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PREFACE 


O dear Ophelia! 
I am ill at these numbers: 
| have not art to reckon my groans. 


— HAMLET (Act Il, Scene 2, Line 120) 


THE ALGORITHMS discussed in this book deal directly with numbers; yet I 
believe they are properly called seminumerical, because they lie on the borderline 
between numeric and symbolic calculation. Each algorithm not only computes 
the desired answers to a numerical problem, it also is intended to blend well 
with the internal operations of a digital computer. In many cases people are 
not able to appreciate the full beauty of such an algorithm unless they also 
have some knowledge of a computer’s machine language; the efficiency of the 
corresponding machine program is a vital factor that cannot be divorced from 
the algorithm itself. The problem is to find the best ways to make computers 
deal with numbers, and this involves tactical as well as numerical considerations. 
Therefore the subject matter of this book is unmistakably a part of computer 
science, as well as of numerical mathematics. 

Some people working in “higher levels” of numerical analysis will regard the 
topics treated here as the domain of system programmers. Other people working 
in “higher levels” of system programming will regard the topics treated here as 
the domain of numerical analysts. But I hope that there are a few people left who 
will want to look carefully at these basic methods. Although the methods reside 
perhaps on a low level, they underlie all of the more grandiose applications of 
computers to numerical problems, so it is important to know them well. We are 
concerned here with the interface between numerical mathematics and computer 
programming, and it is the mating of both types of skills that makes the subject 
so interesting. 

There is a noticeably higher percentage of mathematical material in this 
book than in other volumes of this series, because of the nature of the subjects 
treated. In most cases the necessary mathematical topics are developed here 
starting almost from scratch (or from results proved in Volume 1), but in several 
easily recognizable sections a knowledge of calculus has been assumed. 

This volume comprises Chapters 3 and 4 of the complete series. Chapter 3 
is concerned with “random numbers”: It is not only a study of various ways to 
generate random sequences, it also investigates statistical tests for randomness, 
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as well as the transformation of uniform random numbers into other types of 
random quantities; the latter subject illustrates how random numbers are used 
in practice. I have also included a section about the nature of randomness 
itself. Chapter 4 is my attempt to tell the fascinating story of what people 
have discovered about the processes of arithmetic, after centuries of progress. It 
discusses various systems for representing numbers, and how to convert between 
them; and it treats arithmetic on floating point numbers, high-precision integers, 
rational fractions, polynomials, and power series, including the questions of 
factoring and finding greatest common divisors. 

Each of Chapters 3 and 4 can be used as the basis of a one-semester college 
course at the junior to graduate level. Although courses on “Random Numbers” 
and on “Arithmetic” are not presently a part of many college curricula, I be- 
lieve the reader will find that the subject matter of these chapters lends itself 
nicely to a unified treatment of material that has real educational value. My 
own experience has been that these courses are a good means of introducing 
elementary probability theory and number theory to college students. Nearly 
all of the topics usually treated in such introductory courses arise naturally 
in connection with applications, and the presence of these applications can be 
an important motivation that helps the student to learn and to appreciate the 
theory. Furthermore, each chapter gives a few hints of more advanced topics 
that will whet the appetite of many students for further mathematical study. 

For the most part this book is self-contained, except for occasional discus- 
sions relating to the MIX computer explained in Volume 1. Appendix B contains a 
summary of the mathematical notations used, some of which are a little different 
from those found in traditional mathematics books. 


Preface to the Third Edition 


When the second edition of this book was completed in 1980, it represented the 
first major test case for prototype systems of electronic publishing called TẸX 
and METAFONT. I am now pleased to celebrate the full development of those 
systems by returning to the book that inspired and shaped them. At last I am 
able to have all volumes of The Art of Computer Programming in a consistent 
format that will make them readily adaptable to future changes in printing and 
display technology. The new setup has allowed me to make many thousands of 
improvements that I have been wanting to incorporate for a long time. 

In this new edition I have gone over every word of the text, trying to retain 
the youthful exuberance of my original sentences while perhaps adding some 
more mature judgment. Dozens of new exercises have been added; dozens of 
old exercises have been given new and improved answers. Changes appear ev- 
erywhere, but most significantly in Sections 3.5 (about theoretical guarantees of 
randomness), 3.6 (about portable random-number generators), 4.5.2 (about the 
binary gcd algorithm), and 4.7 (about composition and iteration of power series). 
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>, The Art of Computer Programming is, however, still a work in progress. 

£ Research on seminumerical algorithms continues to grow at a phenomenal 
rate. Therefore some parts of this book are headed by an “under construction” 
icon, to apologize for the fact that the material is not up-to-date. My files are 
bursting with important material that I plan to include in the final, glorious, 
fourth edition of Volume 2, perhaps 16 years from now; but I must finish 
Volumes 4 and 5 first, and I do not want to delay their publication any more 
than absolutely necessary. 


Iam enormously grateful to the many hundreds of people who have helped 
me to gather and refine this material during the past 35 years. Most of the hard 
work of preparing the new edition was accomplished by Silvio Levy, who expertly 
edited the electronic text, and by Jeffrey Oldham, who converted nearly all of 
the original illustrations to METAPOST format. I have corrected every error that 
alert readers detected in the second edition (as well as some mistakes that, alas, 
nobody noticed); and I have tried to avoid introducing new errors in the new 
material. However, I suppose some defects still remain, and I want to fix them 
as soon as possible. Therefore I will cheerfully award $2.56 to the first finder of 
each technical, typographical, or historical error. The webpage cited on page iv 
contains a current listing of all corrections that have been reported to me. 


Stanford, California D. E. K. 
July 1997 


When a book has been eight years in the making, 

there are too many colleagues, typists, students, 

teachers, and friends to thank. 

Besides, I have no intention of giving such people 

the usual exoneration from responsibility for errors which remain. 
They should have corrected me! 

And sometimes they are even responsible for ideas 

which may turn out in the long run to be wrong. 

Anyway, to such fellow explorers, my thanks. 


— EDWARD F. CAMPBELL, JR. (1975) 


‘Defendit numerus,’ [there is safety in numbers] 
is the maxim of the foolish; 

‘Deperdit numerus,’ [there is ruin in numbers] 
of the wise. 


— C. C. COLTON (1820) 
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NOTES ON THE EXERCISES 


THE EXERCISES in this set of books have been designed for self-study as well as 
for classroom study. It is difficult, if not impossible, for anyone to learn a subject 
purely by reading about it, without applying the information to specific problems 
and thereby being encouraged to think about what has been read. Furthermore, 
we all learn best the things that we have discovered for ourselves. Therefore the 
exercises form a major part of this work; a definite attempt has been made to 
keep them as informative as possible and to select problems that are enjoyable 
as well as instructive. 

In many books, easy exercises are found mixed randomly among extremely 
difficult ones. A motley mixture is, however, often unfortunate because readers 
like to know in advance how long a problem ought to take — otherwise they 
may just skip over all the problems. A classic example of such a situation is 
the book Dynamic Programming by Richard Bellman; this is an important, 
pioneering work in which a group of problems is collected together at the end 
of some chapters under the heading “Exercises and Research Problems,” with 
extremely trivial questions appearing in the midst of deep, unsolved problems. 
It is rumored that someone once asked Dr. Bellman how to tell the exercises 
apart from the research problems, and he replied, “If you can solve it, it is an 
exercise; otherwise it’s a research problem.” 

Good arguments can be made for including both research problems and 
very easy exercises in a book of this kind; therefore, to save the reader from 
the possible dilemma of determining which are which, rating numbers have been 
provided to indicate the level of difficulty. These numbers have the following 
general significance: 


Rating Interpretation 


00 An extremely easy exercise that can be answered immediately if the 
material of the text has been understood; such an exercise can almost 
always be worked “in your head.” 


10 A simple problem that makes you think over the material just read, but 
is by no means difficult. You should be able to do this in one minute at 
most; pencil and paper may be useful in obtaining the solution. 


20 An average problem that tests basic understanding of the text mate- 
rial, but you may need about fifteen or twenty minutes to answer it 
completely. 
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30 A problem of moderate difficulty and/or complexity; this one may 
involve more than two hours’ work to solve satisfactorily, or even more 
if the TV is on. 


40 Quite a difficult or lengthy problem that would be suitable for a term 
project in classroom situations. A student should be able to solve the 
problem in a reasonable amount of time, but the solution is not trivial. 


50 A research problem that has not yet been solved satisfactorily, as far 
as the author knew at the time of writing, although many people have 
tried. If you have found an answer to such a problem, you ought to 
write it up for publication; furthermore, the author of this book would 
appreciate hearing about the solution as soon as possible (provided that 
it is correct). 


By interpolation in this “logarithmic” scale, the significance of other rating 
numbers becomes clear. For example, a rating of 17 would indicate an exercise 
that is a bit simpler than average. Problems with a rating of 50 that are 
subsequently solved by some reader may appear with a 40 rating in later editions 
of the book, and in the errata posted on the Internet (see page iv). 

The remainder of the rating number divided by 5 indicates the amount of 
detailed work required. Thus, an exercise rated 24 may take longer to solve 
than an exercise that is rated 25, but the latter will require more creativity. All 
exercises with ratings of 46 or more are open problems for future research, rated 
according to the number of different attacks that they’ve resisted so far. 

The author has tried earnestly to assign accurate rating numbers, but it is 
difficult for the person who makes up a problem to know just how formidable it 
will be for someone else to find a solution; and everyone has more aptitude for 
certain types of problems than for others. It is hoped that the rating numbers 
represent a good guess at the level of difficulty, but they should be taken as 
general guidelines, not as absolute indicators. 

This book has been written for readers with varying degrees of mathematical 
training and sophistication; as a result, some of the exercises are intended only for 
the use of more mathematically inclined readers. The rating is preceded by an M 
if the exercise involves mathematical concepts or motivation to a greater extent 
than necessary for someone who is primarily interested only in programming 
the algorithms themselves. An exercise is marked with the letters “HM” if its 
solution necessarily involves a knowledge of calculus or other higher mathematics 
not developed in this book. An “HM” designation does not necessarily imply 
difficulty. 

Some exercises are preceded by an arrowhead, “>”; this designates prob- 
lems that are especially instructive and especially recommended. Of course, no 
reader/student is expected to work all of the exercises, so those that seem to 
be the most valuable have been singled out. (This distinction is not meant to 
detract from the other exercises!) Each reader should at least make an attempt 
to solve all of the problems whose rating is 10 or less; and the arrows may help 
to indicate which of the problems with a higher rating should be given priority. 
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Solutions to most of the exercises appear in the answer section. Please use 
them wisely; do not turn to the answer until you have made a genuine effort to 
solve the problem by yourself, or unless you absolutely do not have time to work 
this particular problem. After getting your own solution or giving the problem a 
decent try, you may find the answer instructive and helpful. The solution given 
will often be quite short, and it will sketch the details under the assumption 
that you have earnestly tried to solve it by your own means first. Sometimes the 
solution gives less information than was asked; often it gives more. It is quite 
possible that you may have a better answer than the one published here, or you 
may have found an error in the published solution; in such a case, the author 
will be pleased to know the details. Later printings of this book will give the 
improved solutions together with the solver’s name where appropriate. 

When working an exercise you may generally use the answers to previous 
exercises, unless specifically forbidden from doing so. The rating numbers have 
been assigned with this in mind; thus it is possible for exercise n + 1 to have a 
lower rating than exercise n, even though it includes the result of exercise n as 
a special case. 


Summary of codes: 00 Immediate 
10 Simple (one minute) 
20 Medium (quarter hour) 


> Recommended 30 Moderately hard 

M Mathematically oriented 40 Term project 

HM Requiring “higher math” 50 Research problem 
EXERCISES 


1. [00] What does the rating “M20” mean? 

2. [10] Of what value can the exercises in a textbook be to the reader? 

3. [M34] Leonhard Euler conjectured in 1772 that the equation wf + a2* + yf = 2^ 
has no solution in positive integers, but Noam Elkies proved in 1987 that infinitely 
many solutions exist [see Math. Comp. 51 (1988), 825-835]. Find all integer solutions 
such that 0< w< x< y< z< 10°. 

4. [M50] Prove that when n is an integer, n > 4, the equation w” + x” +y” = z 
has no solution in positive integers w, £, y, Z. 


n 


Exercise is the beste instrument in learnyng. 
— ROBERT RECORDE, The Whetstone of Witte (1557) 
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CHAPTER THREE 


RANDOM NUMBERS 


Any one who considers arithmetical 
methods of producing random digits 
is, of course, in a state of sin. 


— JOHN VON NEUMANN (1951) 


Lest men suspect your tale untrue, 
Keep probability in view. 


— JOHN GAY (1727) 


There wanted not some beams of light 
to guide men in the exercise of their Stocastick faculty. 


— JOHN OWEN (1662) 


3.1. INTRODUCTION 


NUMBERS that are “chosen at random” are useful in many different kinds of 
applications. For example: 


a) Simulation. When a computer is being used to simulate natural phenomena, 
random numbers are required to make things realistic. Simulation covers many 
fields, from the study of nuclear physics (where particles are subject to random 
collisions) to operations research (where people come into, say, an airport at 
random intervals). 

b) Sampling. It is often impractical to examine all possible cases, but a random 
sample will provide insight into what constitutes “typical” behavior. 


c) Numerical analysis. Ingenious techniques for solving complicated numerical 
problems have been devised using random numbers. Several books have been 
written on this subject. 


d) Computer programming. Random values make a good source of data for 
testing the effectiveness of computer algorithms. More importantly, they are 
crucial to the operation of randomized algorithms, which are often far superior 
to their deterministic counterparts. This use of random numbers is the primary 
application of interest to us in this series of books; it accounts for the fact that 
random numbers are already being considered here in Chapter 3, before most of 
the other computer algorithms have appeared. 
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e) Decision making. There are reports that many executives make their deci- 
sions by flipping a coin or by throwing darts, etc. It is also rumored that some 
college professors prepare their grades on such a basis. Sometimes it is important 
to make a completely “unbiased” decision. Randomness is also an essential part 
of optimal strategies in the theory of matrix games. 


f) Cryptography. A source of unbiased bits is crucial for many types of secure 
communications, when data needs to be concealed. 

g) Aesthetics. A little bit of randomness makes computer-generated graphics 
and music seem more lively. For example, a pattern like 


tends to look 
more appealing than 


in certain contexts. [See D. E. Knuth, Bull. Amer. Math. Soc. 1 (1979), 369.] 


h) Recreation. Rolling dice, shuffling decks of cards, spinning roulette wheels, 
etc., are fascinating pastimes for just about everybody. These traditional uses 
of random numbers have suggested the name “Monte Carlo method,” a general 
term used to describe any algorithm that employs random numbers. 


People who think about this topic almost invariably get into philosophical 
discussions about what the word “random” means. In a sense, there is no such 
thing as a random number; for example, is 2a random number? Rather, we speak 
of a sequence of independent random numbers with a specified distribution, and 
this means loosely that each number was obtained merely by chance, having 
nothing to do with other numbers of the sequence, and that each number has a 
specified probability of falling in any given range of values. 

A uniform distribution on a finite set of numbers is one in which each possible 
number is equally probable. A distribution is generally understood to be uniform 
unless some other distribution is specifically mentioned. 

Each of the ten digits 0 through 9 will occur about 5 of the time in a 
(uniform) sequence of random digits. Each pair of two successive digits should 
occur about 1 of the time, and so on. Yet if we take a truly random sequence 
of a million digits, it will not always have exactly 100,000 zeros, 100,000 ones, 
etc. In fact, chances of this are quite slim; a sequence of such sequences will have 
this character on the average. 

Any specified sequence of a million digits is as probable as any other. Thus, 
if we are choosing a million digits at random and if the first 999,999 of them 
happen to come out to be zero, the chance that the final digit is zero is still 
exactly b in a truly random situation. These statements seem paradoxical to 
many people, yet no contradiction is really involved. 

There are several ways to formulate decent abstract definitions of random- 
ness, and we will return to this interesting subject in Section 3.5; but for the 
moment, let us content ourselves with an intuitive understanding of the concept. 


Many years ago, people who needed random numbers in their scientific work 
would draw balls out of a “well-stirred urn,” or they would roll dice or deal out 
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cards. A table of over 40,000 random digits, “taken at random from census 
reports,” was published in 1927 by L. H. C. Tippett. Since then, a number of 
devices have been built to generate random numbers mechanically. The first such 
machine was used in 1939 by M. G. Kendall and B. Babington-Smith to produce 
a table of 100,000 random digits. The Ferranti Mark I computer, first installed 
in 1951, had a built-in instruction that put 20 random bits into the accumulator 
using a resistance noise generator; this feature had been recommended by A. M. 
Turing. In 1955, the RAND Corporation published a widely used table of a 
million random digits obtained with the help of another special device. A famous 
random-number machine called ERNIE has been used for many years to pick the 
winning numbers in the British Premium Savings Bonds lottery. [F. N. David de- 
scribes the early history in Games, Gods, and Gambling (1962). See also Kendall 
and Babington-Smith, J. Royal Stat. Soc. A101 (1938), 147-166; B6 (1939), 51- 
61; S. H. Lavington’s discussion of the Mark I in CACM 21 (1978), 4-12; the 
review of the RAND table in Math. Comp. 10 (1956), 39-43; and the discussion 
of ERNIE by W. E. Thomson, J. Royal Stat. Soc. A122 (1959), 301-333.] 

Shortly after computers were introduced, people began to search for efficient 
ways to obtain random numbers within computer programs. A table could be 
used, but this method is of limited utility because of the memory space and 
input time requirement, because the table may be too short, and because it 
is a bit of a nuisance to prepare and maintain the table. A machine such as 
ERNIE might be attached to the computer, as in the Ferranti Mark I, but this 
has proved to be unsatisfactory since it is impossible to reproduce calculations 
exactly a second time when checking out a program; moreover, such machines 
have tended to suffer from malfunctions that are extremely difficult to detect. 
Advances in technology made tables useful again during the 1990s, because 
a billion well-tested random bytes could easily be made accessible. George 
Marsaglia helped resuscitate random tables in 1995 by preparing a demonstration 
disk that contained 650 random megabytes, generated by combining the output 
of a noise-diode circuit with deterministically scrambled rap music. (He called 
it “white and black noise.”) 

The inadequacy of mechanical methods in the early days led to an interest 
in the production of random numbers using a computer’s ordinary arithmetic 
operations. John von Neumann first suggested this approach in about 1946; 
his idea was to take the square of the previous random number and to extract 
the middle digits. For example, if we are generating 10-digit numbers and the 
previous value was 5772156649, we square it to get 


33317792380594909201; 


the next number is therefore 7923805949. 

There is a fairly obvious objection to this technique: How can a sequence 
generated in such a way be random, since each number is completely determined 
by its predecessor? (See von Neumann’s comment at the beginning of this 
chapter.) The answer is that the sequence isn’t random, but it appears to 
be. In typical applications the actual relationship between one number and 
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its successor has no physical significance; hence the nonrandom character is 
not really undesirable. Intuitively, the middle square seems to be a fairly good 
scrambling of the previous number. 

Sequences generated in a deterministic way such as this are often called 
pseudorandom or quasirandom sequences in the highbrow technical literature, 
but in most places of this book we shall simply call them random sequences, 
with the understanding that they only appear to be random. Being “apparently 
random” is perhaps all that can be said about any random sequence anyway. 
Random numbers generated deterministically on computers have worked quite 
well in nearly every application, provided that a suitable method has been 
carefully selected. Of course, deterministic sequences aren’t always the answer; 
they certainly shouldn’t replace ERNIE for the lotteries. 

Von Neumann’s original “middle-square method” has actually proved to bea 
comparatively poor source of random numbers. The danger is that the sequence 
tends to get into a rut, a short cycle of repeating elements. For example, if zero 
ever appears as a number of the sequence, it will continually perpetuate itself. 

Several people experimented with the middle-square method in the early 
1950s. Working with numbers that have four digits instead of ten, G. E. Forsythe 
tried 16 different starting values and found that 12 of them led to sequences 
ending with the cycle 6100, 2100, 4100, 8100, 6100, ..., while two of them 
degenerated to zero. More extensive tests were carried out by N. Metropolis, 
mostly in the binary number system. He showed that when 20-bit numbers are 
being used, there are 13 different cycles into which the middle-square sequence 
might degenerate, the longest of which has a period of length 142. 

It is fairly easy to restart the middle-square method on a new value when 
zero has been detected, but long cycles are somewhat harder to avoid. Exercises 6 
and 7 discuss some interesting ways to determine the cycles of periodic sequences, 
using very little memory space. 

A theoretical disadvantage of the middle-square method is given in exercises 
9 and 10. On the other hand, working with 38-bit numbers, Metropolis obtained 
a sequence of about 750,000 numbers before degeneracy occurred, and the re- 
sulting 750,000 x 38 bits satisfactorily passed statistical tests for randomness. 
[Symp. on Monte Carlo Methods (Wiley, 1956), 29-36.] This experience showed 
that the middle-square method can give usable results, but it is rather dangerous 
to put much faith in it until after elaborate computations have been performed. 

Many random number generators in use when this chapter was first written 
were not very good. People have traditionally tended to avoid learning about 
such subroutines; old methods that were comparatively unsatisfactory have been 
passed down blindly from one programmer to another, until the users have no 
understanding of the original limitations. We shall see in this chapter that the 
most important facts about random number generators are not difficult to learn, 
although prudence is necessary to avoid common pitfalls. 

It is not easy to invent a foolproof source of random numbers. This fact was 
convincingly impressed upon the author in 1959, when he attempted to create a 
fantastically good generator using the following peculiar approach: 
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Algorithm K (“Super-random” number generator). Given a 10-digit decimal 
number X, this algorithm may be used to change X to the number that should 
come next in a supposedly random sequence. Although the algorithm might be 
expected to yield quite a random sequence, reasons given below show that it 
is not, in fact, very good at all. (The reader need not study this algorithm in 
great detail except to observe how complicated it is; note, in particular, steps 
K1 and K2.) 

K1. [Choose number of iterations.] Set Y + |X/10°|, the most significant 
digit of X. (We will execute steps K2 through K13 exactly Y + 1 times; 
that is, we will apply randomizing transformations a random number of 
times. ) 

K2. [Choose random step.] Set Z + |X/10°| mod 10, the second most signifi- 
cant digit of X. Go to step K(3+ Z). (That is, we now jump to a random 
step in the program.) 

K3. [Ensure > 5 x 10°.] If X < 5000000000, set X «+ X + 5000000000. 

K4. [Middle square.] Replace X by | X?/10°| mod 101°, that is, by the middle 

of the square of X. 

K5. [Multiply.] Replace X by (1001001001 X) mod 10”. 

K6. [Pseudo-complement.] If X < 100000000, then set X + X + 9814055677; 

otherwise set X + 101° — X. 

K7. [Interchange halves.] Interchange the low-order five digits of X with the 

high-order five digits; that is, set X + 10°(X mod 10°) + |X/10°], the 

middle 10 digits of (101° +1)X. 

K8. [Multiply.] Same as step K5. 

K9. [Decrease digits.] Decrease each nonzero digit of the decimal representation 

of X by one. 

K10. [99999 modify.] If X < 10°, set X + X? + 99999; otherwise set X + 

X — 99999. 

K11. [Normalize.] (At this point X cannot be zero.) If X < 10°, set X + 10X 

and repeat this step. 

K12. [Modified middle square.] Replace X by | X(X — 1)/10°| mod 10!°, that 

is, by the middle 10 digits of X(X — 1). 

K13. [Repeat?] If Y > 0, decrease Y by 1 and return to step K2. If Y = 0, the 
algorithm terminates with X as the desired “random” value. J 


(The machine-language program corresponding to this algorithm was intended 
to be so complicated that a person reading a listing of it without explanatory 
comments wouldn’t know what the program was doing.) 

Considering all the contortions of Algorithm K, doesn’t it seem plausible that 
it should produce almost an infinite supply of unbelievably random numbers? 
No! In fact, when this algorithm was first put onto a computer, it almost im- 
mediately converged to the 10-digit value 6065038420, which — by extraordinary 
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Table 1 


A COLOSSAL COINCIDENCE: THE NUMBER 6065038420 
IS TRANSFORMED INTO ITSELF BY ALGORITHM K. 


Step X (after) Step X (after) 

K9 1107855700 
K1 6065038420 

K10 1107755701 

K3 6065038420 

K11 1107755701 
K4 6910360760 

K12 1226919902 Y=3 
K5 8031120760 

K5 0048821902 
K6 1968879240 

K6 9862877579 
K7 7924019688 

K7 7757998628 
K8 9631707688 
K9 8520606577 ko 299 A0200aS 

K9 1273515517 


K10 8520506578 
K11 8520506578 
K12 0323372207 Y=6 
K6 9676627793 
K7 2779396766 
K8 4942162766 


K10 1273415518 
K11 1273415518 
K12 5870802097 Y=2 
K11 5870802097 
K12 3172562687 Y=1 


K4 1540029446 
K9 3831051655 

K5 7015475446 
K10 3830951656 : 

K6 2984524554 
K11 3830951656 

K7 2455429845 
K12 1905867781 Y=5 

K8 2730274845 
K12 3319967479 Y=4 

i K9 1620163734 

K6 6680032521 , 

K10 1620063735 
aad 3204100300 K11 1620063735 
K8 2218966800 


K12 6065038420 Y=0 


coincidence — is transformed into itself by the algorithm (see Table 1). With 
another starting number, the sequence began to repeat after 7401 values, in a 
cyclic period of length 3178. 

The moral of this story is that random numbers should not be generated 
with a method chosen at random. Some theory should be used. 


In the following sections we shall consider random number generators that 
are superior to the middle-square method and to Algorithm K. The correspond- 
ing sequences are guaranteed to have certain desirable random properties, and 
no degeneracy will occur. We shall explore the reasons for this random-like 
behavior in some detail, and we shall also consider techniques for manipulating 
random numbers. For example, one of our investigations will be the shuffling of 
a simulated deck of cards within a computer program. 

Section 3.6 summarizes this chapter and lists several bibliographic sources. 


EXERCISES 


> 1. [20] Suppose that you wish to obtain a decimal digit at random, not using a 
computer. Which of the following methods would be suitable? 


3.1 


a) 


b) 
c) 


2. 
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Open a telephone directory to a random place by sticking your finger in it some- 
where, and use the units digit of the first number found on the selected page. 
Same as (a), but use the units digit of the page number. 

Roll a die that is in the shape of a regular icosahedron, whose twenty faces have 
been labeled with the digits 0, 0, 1, 1, ..., 9, 9. Use the digit that appears on 
top, when the die comes to rest. (A felt-covered table with a hard surface is 
recommended for rolling dice.) 

Expose a geiger counter to a source of radioactivity for one minute (shielding 
yourself) and use the units digit of the resulting count. Assume that the geiger 
counter displays the number of counts in decimal notation, and that the count is 
initially zero. 

Glance at your wristwatch; and if the position of the second-hand is between 6n 
and 6(n + 1) seconds, choose the digit n. 

Ask a friend to think of a random digit, and use the digit he names. 

Ask an enemy to think of a random digit, and use the digit he names. 

Assume that 10 horses are entered in a race and that you know nothing whatever 
about their qualifications. Assign to these horses the digits 0 to 9, in arbitrary 
fashion, and after the race use the winner’s digit. 


[M22] In a random sequence of a million decimal digits, what is the probability 


that there are exactly 100,000 of each possible digit? 


3. 
4. 


[10] What number follows 1010101010 in the middle-square method? 
[20] (a) Why can’t the value of X be zero when step K11 of Algorithm K is 


performed? What would be wrong with the algorithm if X could be zero? (b) Use 
Table 1 to deduce what happens when Algorithm K is applied repeatedly with the 
starting value X = 3830951656. 


5. 


[15] Explain why, in any case, Algorithm K should not be expected to provide 


infinitely many random numbers, in the sense that (even if the coincidence given in 
Table 1 had not occurred) one knows in advance that any sequence generated by 
Algorithm K will eventually be periodic. 


> 6. 


[M21] Suppose that we want to generate a sequence of integers Xo, X1, X2, ..., 


in the range 0 < Xn < m. Let f(x) be any function such that 0 < x < m implies 
0 < f(x) <m. Consider a sequence formed by the rule Xn+1 = f(Xn). (Examples are 
the middle-square method and Algorithm K.) 


a) 


Show that the sequence is ultimately periodic, in the sense that there exist numbers 
A and u for which the values 


Kor Xures Rog oa Xaa 


are distinct, but Xn+. = Xn when n > u. Find the maximum and minimum 
possible values of js and À. 

(R. W. Floyd.) Show that there exists an n > 0 such that Xn = Xən; and the 
smallest such value of n lies in the range u < n < u+ AÀ. Furthermore the value of 
Xn is unique in the sense that if Xn = Xe, and X, = Xər, then X, = Xn. 

Use the idea of part (b) to design an algorithm that calculates u and A for any 
given function f and any given Xo, using only O(u + A) steps and only a bounded 
number of memory locations. 
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> 7. [M21] (R. P. Brent, 1977.) Let (n) be the greatest power of 2 that is less than 
or equal to n; thus, for example, (15) = 8 and &(€(n)) = (n). 


a) Show that, in terms of the notation in exercise 6, there exists an n > 0 such that 
Xn = Xen)-1. Find a formula that expresses the least such n in terms of the 
periodicity numbers u and A. 

b) Apply this result to design an algorithm that can be used in conjunction with any 
random number generator of the type Xn+1 = f(Xn), to prevent it from cycling 
indefinitely. Your algorithm should calculate the period length A, and it should 
use only a small amount of memory space — you must not simply store all of the 
computed sequence values! 


8. [23] Make a complete examination of the middle-square method in the case of 

two-digit decimal numbers. 

a) We might start the process out with any of the 100 possible values 00, 01, ..., 
99. How many of these values lead ultimately to the repeating cycle 00, 00, ...? 
[Example: Starting with 43, we obtain the sequence 43, 84, 05, 02, 00, 00, 00, ....] 

b) How many possible final cycles are there? How long is the longest cycle? 

c) What starting value or values will give the largest number of distinct elements 
before the sequence repeats? 


9. [M14] Prove that the middle-square method using 2n-digit numbers to the base b 
has the following disadvantage: If the sequence includes any number whose most 
significant n digits are zero, the succeeding numbers will get smaller and smaller until 
zero occurs repeatedly. 


10. [M16] Under the assumptions of the preceding exercise, what can you say about 
the sequence of numbers following X if the least significant n digits of X are zero? 
What if the least significant n + 1 digits are zero? 


> 11. [M26] Consider sequences of random number generators having the form de- 
scribed in exercise 6. If we choose f(x) and Xo at random— in other words, if we 
assume that each of the m™ possible functions f(x) is equally probable and that 
each of the m possible values of Xo is equally probable—what is the probability 
that the sequence will eventually degenerate into a cycle of length A = 1? [Note: 
The assumptions of this problem give a natural way to think of a “random” random 
number generator of this type. A method such as Algorithm K may be expected to 
behave somewhat like the generator considered here; the answer to this problem gives 
a measure of how colossal the coincidence of Table 1 really is.] 


> 12. [M31] Under the assumptions of the preceding exercise, what is the average length 
of the final cycle? What is the average length of the sequence before it begins to cycle? 
(In the notation of exercise 6, we wish to examine the average values of À and of p+ A.) 


13. [M42] If f(x) is chosen at random in the sense of exercise 11, what is the average 
length of the longest cycle obtainable by varying the starting value Xo? [Note: We 
have already considered the analogous problem in the case that f(x) is a random 
permutation; see exercise 1.3.3-23.] 


14. [M38] If f(x) is chosen at random in the sense of exercise 11, what is the av- 
erage number of distinct final cycles obtainable by varying the starting value? [See 
exercise 8(b).] 


15. [M15] If f(x) is chosen at random in the sense of exercise 11, what is the proba- 
bility that none of the final cycles has length 1, regardless of the choice of Xo? 
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16. [15] A sequence generated as in exercise 6 must begin to repeat after at most m 
values have been generated. Suppose we generalize the method so that X,+1 depends 
on Xn-1 as well as on Xn; formally, let f(x,y) be a function such that 0 < x,y < m 
implies 0 < f(x,y) < m. The sequence is constructed by selecting Xo and X, 
arbitrarily, and then letting 


Xn41 =f(Xn,Xn-1),  forn>0. 


What is the maximum period conceivably attainable in this case? 


17. [10] Generalize the situation in the previous exercise so that Xn+ı depends on 
the preceding k values of the sequence. 

18. [M20] Invent a method analogous to that of exercise 7 for finding cycles in the 
general form of random number generator discussed in exercise 17. 

19. [HM47| Solve the problems of exercises 11 through 15 asymptotically for the more 
general case that Xn+ı depends on the preceding k values of the sequence; each of the 
mm” functions f(z1,..., £p) is to be considered equally probable. [Note: The number 
of functions that yield the maximum period is analyzed in exercise 2.3.4.2—23.] 

20. [30] Find all nonnegative X < 10'° that lead ultimately via Algorithm K to the 
self-reproducing number in Table 1. 

21. [40] Prove or disprove: The mapping X +> f(X) defined by Algorithm K has 
exactly five cycles, of lengths 3178, 1606, 1024, 943, and 1. 


22. [21] (H. Rolletschek.) Would it be a good idea to generate random numbers by 
using the sequence f(0), f(1), f(2), ..., where f is a random function, instead of using 


xo, f(xo), f(f(xo)), ete.? 

23. [M26] (D. Foata and A. Fuchs, 1970.) Show that each of the m™ functions f(x) 
considered in exercise 6 can be represented as a sequence (£o, £1, ...,&m—1) having the 
following properties: 


i) (£0, £1,- ..,&m—1) is a permutation of (f(0), f(1),...,f(m—1)). 


ii) (f(0),...,f(m—1)) can be uniquely reconstructed from (xo, £1,...,Em—1). 

iii) The elements that appear in cycles of f are {xo,21,...,%%-1}, where k is the 
largest subscript such that these k elements are distinct. 

iv) x; € {xo,v1,...,%j;-1} implies xj_1 = f(a;), unless x; is the smallest element in 
a cycle of f. 

v) (f(0), f(1),..., f(m — 1)) is a permutation of (0,1,...,m — 1) if and only if 
(£0, @1,...,%m-—1) represents the inverse of that permutation by the “unusual 
correspondence” of Section 1.3.3. 

vi) £o = xı if and only if (x1, . . . , &m—1 ) represents an oriented tree by the construction 


of exercise 2.3.4.4-18, with f(x) the parent of x. 
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3.2. GENERATING UNIFORM RANDOM NUMBERS 


IN THIS SECTION we shall consider methods for generating a sequence of random 
fractions — random real numbers Un, uniformly distributed between zero and one. 
Since a computer can represent a real number with only finite accuracy, we 
shall actually be generating integers Xn between zero and some number m; the 
fraction 

Un = Xn/m 


will then lie between zero and one. Usually m is the word size of the computer, 
so Xn may be regarded (conservatively) as the integer contents of a computer 
word with the radix point assumed at the extreme right, and Un may be regarded 
(liberally) as the contents of the same word with the radix point assumed at the 
extreme left. 


3.2.1. The Linear Congruential Method 


By far the most popular random number generators in use today are special 
cases of the following scheme, introduced by D. H. Lehmer in 1949. [See Proc. 
2nd Symp. on Large-Scale Digital Calculating Machinery (Cambridge, Mass.: 
Harvard University Press, 1951), 141-146.] We choose four magic integers: 


m, the modulus; O<m. 
a, the multiplier; O<a<m. a 
c, the increment; O0<c<m. 


Xo, the starting value; 0< Xo <m. 
The desired sequence of random numbers (Xn) is then obtained by setting 
Xn41 = (aX, +c) mod m, n> 0. (2) 


This is called a linear congruential sequence. Taking the remainder mod m is 
somewhat like determining where a ball will land in a spinning roulette wheel. 
For example, the sequence obtained when m = 10 and Xo = a = c = 7 is 


7, 6, 9, 0, 7, 6, 9, 0, .... (3) 


As this example shows, the sequence is not always “random” for all choices of 
m, a, c, and Xo; the principles of choosing the magic numbers appropriately will 
be investigated carefully in later parts of this chapter. 

Example (3) illustrates the fact that the congruential sequences always get 
into a loop: There is ultimately a cycle of numbers that is repeated endlessly. 
This property is common to all sequences having the general form Xy41 = 
f(Xn), when f transforms a finite set into itself; see exercise 3.1-6. The repeating 
cycle is called the period; sequence (3) has a period of length 4. A useful sequence 
will of course have a relatively long period. 

The special case c = 0 deserves explicit mention, since the number generation 
process is a little faster when c = 0 than it is when c 4 0. We shall see later 
that the restriction c = 0 cuts down the length of the period of the sequence, 
but it is still possible to make the period reasonably long. Lehmer’s original 
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generation method had c = 0, although he mentioned c Æ 0 as a possibility; the 
fact that c 4 0 can lead to longer periods is due to Thomson [Comp. J. 1 (1958), 
83, 86] and, independently, to Rotenberg [JACM 7 (1960), 75-77]. The terms 
multiplicative congruential method and mized congruential method are used by 
many authors to denote linear congruential sequences with c = 0 and c ¥ 0, 
respectively. 

The letters m, a, c, and Xo will be used throughout this chapter in the sense 
described above. Furthermore, we will find it useful to define 


b=a-1, (4) 


in order to simplify many of our formulas. 

We can immediately reject the case a = 1, for this would mean that X, = 
(Xo + nc) mod m, and the sequence would certainly not behave as a random 
sequence. The case a = 0 is even worse. Hence for practical purposes we may 
assume that 


a> 2, b> 1. (5) 
Now we can prove a generalization of Eq. (2), 
Xnr = (a Xn + (aë — 1)c/b) mod m, k>0, n>0, (6) 


which expresses the (n+k)th term directly in terms of the nth term. (The special 
case n = 0 in this equation is worthy of note.) It follows that the subsequence 
consisting of every kth term of (Xn) is another linear congruential sequence, 
having the multiplier a” mod m and the increment ((a* — 1)c/b) mod m. 

An important corollary of (6) is that the general sequence defined by m, a, 
c, and Xo can be expressed very simply in terms of the special case where c = 1 
and Xo = 0. Let 


Yo = 0, Yn+1 = (aYn + 1) mod m. (7) 


According to Eq. (6) we will have Yp = (a! —1)/b (modulo m), hence the general 
sequence defined in (2) satisfies 


Xn = (AY, + Xo) mod m, where A = (Xob + c) mod m. (8) 


EXERCISES 


1. [10] Example (3) shows a situation in which X4 = Xo, so the sequence begins 
again from the beginning. Give an example of a linear congruential sequence with 
m = 10 for which Xo never appears again in the sequence. 

2. [M20] Show that if a and m are relatively prime, the number Xo will always 
appear in the period. 


3. [M10] If a and m are not relatively prime, explain why the sequence will be 
somewhat handicapped and probably not very random; hence we will generally want 
the multiplier a to be relatively prime to the modulus m. 


4. [11] Prove Eq. (6). 


5. [M20] Equation (6) holds for k > 0. If possible, give a formula that expresses 
Xn+k in terms of Xn for negative values of k. 
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3.2.1.1. Choice of modulus. Our current goal is to find good values for the 
parameters that define a linear congruential sequence. Let us first consider the 
proper choice of the number m. We want m to be rather large, since the period 
cannot have more than m elements. (Even if we intend to generate only random 
zeros and ones, we should not take m = 2, for then the sequence would at best 
have the form ...,0,1,0,1,0,1,...! Methods for getting random zeros and ones 
from linear congruential sequences are discussed in Section 3.4.) 

Another factor that influences our choice of m is speed of generation: We 
want to pick a value so that the computation of (aX, + c) mod m is fast. 

Consider MIX as an example. We can compute y mod m by putting y in 
registers A and X and dividing by m; assuming that y and m are positive, we 
see that y mod m will then appear in register X. But division is a comparatively 
slow operation, and it can be avoided if we take m to be a value that is especially 
convenient, such as the word size of our computer. 

Let w be the computer’s word size, namely 2° on an e-bit binary computer or 
10° on an e-digit decimal machine. (In this book we shall often use the letter e to 
denote an arbitrary integer exponent, instead of the base of natural logarithms, 
hoping that the context will make our notation unambiguous. Physicists have a 
similar problem when they use e for the charge on an electron.) The result of 
an addition operation is usually given modulo w, except on ones’-complement 
machines; and multiplication mod w is also quite simple, since the desired result 
is the lower half of the product. Thus, the following program computes the 
quantity (aX +c) mod w efficiently: 


LDA A rA & a. 

MUL X rAX < (rA)- X. 

SLAX 5 rA «+ rAXmodw. (2) 
ADD C rA & (rA+c)modw. J 


The result appears in register A. The overflow toggle might be on at the conclu- 
sion of these instructions; if that is undesirable, the code should be followed by, 
say, ‘JOV *+1’ to turn it off. 

A clever technique that is less commonly known can be used to perform 
computations modulo w+1. For reasons to be explained later, we will generally 
want c = 0 when m = w +1, so we merely need to compute (aX) mod (w + 1). 
The following program does this: 


01 LDAN X rA + —X. 

02 MUL A rAX & (rA)-a. 

03 STX TEMP 

04 SUB TEMP rA rA -rX (2) 
05 JANN *+3 Exit if rA > 0. 

06 INCA 2 rA + rA +2. 


07 ADD =w—-1= rA¢rA+w-l. I 


Register A now contains the value (aX) mod (w+ 1). Of course, this value might 
lie anywhere between 0 and w, inclusive, so the reader may legitimately wonder 
how we can represent so many values in the A-register! (The register obviously 
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cannot hold a number larger than w—1.) The answer is that the result equals w 
if and only if program (2) turns overflow on, assuming that overflow was initially 
off. We could represent w by 0, since (2) will not normally be used when X = 
0; but it is most convenient simply to reject the value w if it appears in the 
congruential sequence modulo w+ 1. Then we can also avoid overflow, simply 
by changing lines 05 and 06 of (2) to ‘JANN *+4; INCA 2; JAP *-5’. 

To prove that code (2) actually does determine (aX) mod (w +1), note that 
in line 04 we are subtracting the lower half of the product from the upper half. 
No overflow can occur at this step; and if aX = qw +r, with 0 <r < w, we will 
have the quantity r — q in register A after line 04. Now 


aX =q(w+1)+(r—4gq), 


and we have —w < r — q < w since q < w; hence (aX) mod (w+ 1) equals either 
r—qorr—q+(w+1), depending on whether r—q>0Oorr—q<0. 

A similar technique can be used to get the product of two numbers modulo 
(w — 1); see exercise 8. 

In later sections we shall require a knowledge of the prime factors of m in 
order to choose the multiplier a correctly. Table 1 lists the complete factorization 
of w+1 into primes for nearly every known computer word size; the methods of 
Section 4.5.4 can be used to extend this table if desired. 

The reader may well ask why we bother to consider using m = w +1, when 
the choice m = w is so manifestly convenient. The reason is that when m = w, 
the right-hand digits of X» are much less random than the left-hand digits. If 
d is a divisor of m, and if 


Yn = Xn mod d, (3) 
we can easily show that 
Yn+1 = (aYn + c) mod d. (4) 


(For Xn41 = aXn + c — qm for some integer q, and taking both sides mod d 
causes the quantity qm to drop out when d is a factor of m.) 

To illustrate the significance of Eq. (4), let us suppose, for example, that 
we have a binary computer. If m = w = 2°, the low-order four bits of Xn are 
the numbers Y, = Xn mod 2+. The gist of Eq. (4) is that the low-order four 
bits of (Xn) form a congruential sequence that has a period of length 16 or less. 
Similarly, the low-order five bits are periodic with a period of at most 32; and 
the least significant bit of X,, is either constant or strictly alternating. 

This situation does not occur when m = w+1; in such a case, the low-order 
bits of Xn will behave just as randomly as the high-order bits do. If, for example, 
w = 28° and m = 235 —1, the numbers of the sequence will not be very random if 
we consider only their remainders mod 31, 71, 127, or 122921 (see Table 1); but 
the low-order bit, which represents the numbers of the sequence taken mod 2, 
should be satisfactorily random. 

Another alternative is to let m be the largest prime number less than w. 
This prime may be found by using the techniques of Section 4.5.4, and a table 
of suitably large primes appears in that section. 
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Table 1 
PRIME FACTORIZATIONS OF w+1 
2° —1 e 241 
7-31-151 15 32. 11 - 331 
3-5-17- 257 16 65537 
131071 17 3 - 43691 
337-1973 18 5-13- 37-109 
524287 19 3 - 174763 
3-5%-11-31-41 20 17 - 61681 
7? - 127 - 337 21 32 . 43 - 5419 
3 - 23 - 89 - 683 22 5 - 397-2113 
47 - 178481 23 3 - 2796203 
3?.5-7-13-17-241 24 97 - 257 - 673 
31-601 - 1801 25 3-11- 251-4051 
3 - 2731 - 8191 26 5-53- 157 - 1613 
7 - 73 - 262657 27 34 . 19 - 87211 
3- 5-29-43- 113-127 28 17 - 15790321 
233 - 1103 - 2089 29 3 - 59 - 3033169 
3?.7-11-31-151- 331 30 52. 13-41- 61-1321 
2147483647 31 3 - 715827883 
3-5-17- 257 -65537 32 641 - 6700417 
7 - 23 - 89 - 599479 33 3? . 67 - 683 - 20857 
3 - 43691 - 131071 34 5 - 137 - 953 - 26317 
31-71- 127 - 122921 35 3- 11- 43 - 281 - 86171 
33.5.7-13-19-37-73-109 36 17 - 241 - 433 - 38737 
223 - 616318177 37 3 - 1777 - 25781083 
3 - 174763 - 524287 38 5 - 229 - 457 - 525313 
7-79-8191 - 121369 39 3? . 2731 - 22366891 
3-52-11-17-31-41- 61681 40 257 - 4278255361 
13367 - 164511353 41 3 - 83 - 8831418697 
32 . 72 - 43 - 127-337-5419 42 5-13-29- 113 - 1429 - 14449 
431 - 9719 - 2099863 43 3 - 2932031007403 
3-5-23-89- 397 - 683-2113 44 17 - 353 - 2931542417 
7-31-73- 151-631 - 23311 45 33 . 11 - 19 - 331 - 18837001 
3 - 47 - 178481 - 2796203 46 5 - 277-1013 - 1657 - 30269 
2351 - 4513 - 13264529 47 3 - 283 - 165768537521 
32. 5-7-13- 17 -97 -241 - 257 - 673 48 193 - 65537 - 22253377 
179951 - 3203431780337 59 3 - 2833 - 37171 - 1824726041 
32.52.7-11-13-31-41-61- 151-331-1321 60 17 - 241 - 61681 - 4562284561 
7? - 73 - 127 - 337 - 92737 - 649657 63 33 . 19 - 43 - 5419 - 77158673929 
3-5-17- 257 - 641 - 65537 - 6700417 64 274177 - 67280421310721 
10° — 1 e 10€ +1 
3°.7-11-13-37 6 101 - 9901 
3? . 239 - 4649 7 11 - 909091 
3?-11-73-101-137 8 17 - 5882353 
34 . 37 - 333667 9 7-11-13-19-52579 
32. 11 - 41 - 271-9091 10 101 - 3541 - 27961 
32 - 21649 - 513239 11 11? - 23 - 4093 - 8779 
33.7-11-13-37-101-9901 12 73 - 137 - 99990001 
32 . 11 - 17 - 73-101 - 137 - 5882353 16 353 - 449 - 641 - 1409 - 69857 
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In most applications, the low-order bits are insignificant, and the choice 
m = w is quite satisfactory — provided that the programmer using the random 
numbers does so wisely. 

Our discussion so far has been based on a “signed magnitude” computer like 
MIX. Similar ideas apply to machines that use complement notations, although 
there are some instructive variations. For example, a DECsystem 20 computer 
has 36 bits with two’s complement arithmetic; when it computes the product of 
two nonnegative integers, the lower half contains the least significant 35 bits with 
a plus sign. On this machine we should therefore take w = 235, not 2°°. The 
32-bit two’s complement arithmetic on IBM System/370 computers is different: 
The lower half of a product contains a full 32 bits. Some programmers have 
felt that this is a disadvantage, since the lower half can be negative when the 
operands are positive, and it is a nuisance to correct this; but actually it is a 
distinct advantage from the standpoint of random number generation, since we 
can take m = 2°? instead of 231 (see exercise 4). 


EXERCISES 


1. [M12] In exercise 3.2.1-3 we concluded that the best congruential generators will 
have the multiplier a relatively prime to m. Show that when m = w in this case it is 
possible to compute (aX +c) mod w in just three MIX instructions, rather than the four 
in (1), with the result appearing in register X. 


2. [16] Write a MIX subroutine having the following characteristics: 
Calling sequence: JMP RANDM 
Entry conditions: Location XRAND contains an integer X. 
Exit conditions: X ¢+rA+¢ (aX +c) mod w, rX + 0, overflow off. 


(Thus a call on this subroutine will produce the next random number of a linear 
congruential sequence.) 


3. [M25] Many computers do not provide the ability to divide a two-word number 
by a one-word number; they provide only operations on single-word numbers, such as 
himult(z,y) = |xy/w] and lomult(x, y) = zy mod w, when x and y are nonnegative 
integers less than the word size w. Explain how to evaluate ax mod m in terms of 
himult and lomult, assuming that 0 < a, x < m < w and that m L w. You may use 
precomputed constants that depend on a, m, and w. 


4. [21] Discuss the calculation of linear congruential sequences with m = 23? on 
two’s-complement machines such as the System/370 series. 


5. [20] Given that m is less than the word size, and that x and y are nonnegative 
integers less than m, show that the difference (x — y) mod m may be computed in just 
four MIX instructions, without requiring any division. What is the best code for the 
sum (2 + y) mod m? 


6. [20] The previous exercise suggests that subtraction mod m is easier to perform 
than addition mod m. Discuss sequences generated by the rule 


Xn+1 = (aXn — c) mod m. 


Are these sequences essentially different from linear congruential sequences as defined 
in the text? Are they more suited to efficient computer calculation? 
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7. [M24] What patterns can you spot in Table 1? 


8. [20] Write a MIX program analogous to (2) that computes (aX) mod (w—1). The 
values 0 and w — 1 are to be treated as equivalent in the input and output of your 
program. 


9. [M25] Most high-level programming languages do not provide a good way to 
divide a two-word integer by a one-word integer, nor do they provide the himult 
operation of exercise 3. The purpose of this exercise is to find a reasonable way to 
cope with such limitations when we wish to evaluate ax mod m for variable x and for 
constants 0 < a < m. 


a) Prove that if q = |m/a|, we have a(x — (x mod q)) = |x/q|(m— (m mod a)). 
b) Use the identity of (a) to evaluate az mod m without computing any numbers that 
exceed m in absolute value, assuming that a? < m. 


10. [M26] The solution to exercise 9(b) sometimes works also when a? > m. Exactly 
how many multipliers a are there for which the intermediate results in that method 
never exceed m, for all x between 0 and m? 


11. [M380] Continuing exercise 9, show that it is possible to evaluate ax mod m using 
only the following basic operations: 

i) u x v, where u > 0, v > 0, and uv < m; 

ii) [u/v], where 0 < v < u < m; 
iii) (u — v) mod m, where 0 < u,v < m. 


In fact, it is always possible to do this with at most 12 operations of types (i) and (ii), 
and with a bounded number of operations of type (iii), not counting the precomputation 
of constants that depend on a and m. For example, explain how to proceed when a is 
62089911 and m is 2°! — 1. (These constants appear in Table 3.3.4-1.) 


12. [M28] Consider computations by pencil and paper or an abacus. 

a) What’s a good way to multiply a given 10-digit number by 10, modulo 9999998999? 

b) Same question, but multiply instead by 999999900 (modulo 9999998999). 

c) Explain how to compute the powers 999999900” mod 9999998999, for n = 1, 2, 
eee 

d) Relate such computations to the decimal expansion of 1/9999998999. 

e) Show that these ideas make it possible to implement certain kinds of linear con- 
gruential generators that have extremely large moduli, using only a few operations 
per generated number. 


13. [M24] Repeat the previous exercise, but with modulus 9999999001 and with 
multipliers 10 and 8999999101. 


14. [M25] Generalize the ideas of the previous two exercises, obtaining a large family 
of linear congruential generators with extremely large moduli. 


3.2.1.2. Choice of multiplier. In this section we shall consider how to choose 
the multiplier a so as to produce a period of maximum length. A long period 
is essential for any sequence that is to be used as a source of random numbers; 
indeed, we would hope that the period contains considerably more numbers than 
will ever be used in a single application. Therefore we shall concern ourselves in 
this section with the question of period length. The reader should keep in mind, 
however, that a long period is only one desirable criterion for the randomness of 
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a linear congruential sequence. For example, when a = c = 1, the sequence is 
simply Xn+ı = (Xn + 1) mod m, and this obviously has a period of length m, 
yet it is anything but random. Other considerations affecting the choice of a 
multiplier will be given later in this chapter. 

Since only m different values are possible, the period surely cannot be longer 
than m. Can we achieve the maximum length, m? The example above shows that 
it is always possible, although the choice a = c = 1 does not yield a desirable 
sequence. Let us investigate all possible choices of a, c, and Xo that give a 
period of length m. It turns out that all such values of the parameters can be 
characterized very simply; when m is the product of distinct primes, only a = 1 
will produce the full period, but when m is divisible by a high power of some 
prime there is considerable latitude in the choice of a. The following theorem 
makes it easy to tell if the maximum period is achieved. 

Theorem A. The linear congruential sequence defined by m, a, c, and Xo has 
period length m if and only if 

i) c is relatively prime to m; 

ii) b=a-—1 is a multiple of p, for every prime p dividing m; 
iii) b is a multiple of 4, if m is a multiple of 4. 

The ideas used in the proof of this theorem go back at least a hundred 
years. But the first proof of the theorem in this particular form was given by 
M. Greenberger in the special case m = 2° [see JACM 8 (1961), 163-167], and 
the sufficiency of conditions (i), (ii), and (iii) in the general case was shown by 
Hull and Dobell [see SIAM Review 4 (1962), 230-254]. To prove the theorem 
we will first consider some auxiliary number-theoretic results that are of interest 
in themselves. 

Lemma P. Let p be a prime number, and let e be a positive integer, where 
po > 2. If 
x=1(modulo pê), «#1 (modulo p*t?), (2) 
then 
x? = 1 (modulo p**!), <x? £1 (modulo p**?). (2) 


Proof. We have x = 1+ qp° for some integer q that is not a multiple of p. By 
the binomial formula 


r? =1+ ) qp? +--+ Fa ae a + gp? 


e 1/P\ pe, 1(P e 1 /P\ p-1,(p-1e 
=1l+qp°" EROL +2 (3) ap" Pitas 1p? a 


The quantity in parentheses is an integer, and, in fact, every term inside the 
parentheses is a multiple of p except the first term. For if 1 < k < p, the 
binomial coefficient (?) is divisible by p (see exercise 1.2.6-10); hence 


L(P\ 4-1 (k—1)e 
N E 
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is divisible by p“*—Y¢. And the last term is g?~!p®-e-!, which is divisible by p 
since (p — 1)e > 1 when p® > 2. So a? = 1+ qp*t! (modulo p**?), and this 
completes the proof. (N ote: A generalization of this result appears in exercise 
3.2.2-11(a).) I 


Lemma Q. Let the decomposition of m into prime factors be 


m = pi"... pi". (3) 


The length A of the period of the linear congruential sequence determined by 
(Xo, a,¢,m) is the least common multiple of the lengths A; of the periods of the 
linear congruential sequences (Xo mod p3’, a mod ee cmod Dj p; JV 1<j<t. 
Proof. By induction on t, it suffices to prove that if mı and mz are relatively 
prime, the length A of the period of the linear congruential sequence determined 
by the parameters (Xo, a,c, mımə2) is the least common multiple of the lengths 
Ai and Xz of the periods of the sequences determined by (Xo mod mı, a mod mı, 
c mod m1, mı) and (Xo mod mz, a mod mz, c mod mz, mz). We observed in the 
previous section, Eq. (4), that if the elements of these three sequences are 
respectively denoted by Xn, Yn, and Zn, we will have 


Yn = Xn mod mı and Zn = Xn mod mo, for all n > 0. 
Therefore, by Law D of Section 1.2.4, we find that 
Xn = Xk if and only if Yn = Yy and Zn = Zp. (4) 


Let X’ be the least common multiple of \y and A2; we wish to prove that 
N =X. Since X, = Xy+) for all suitably large n, we have Y, = Y+) (hence 
À is a multiple of 41) and Z„ = Zn+\ (hence A is a multiple of Az), so we must 
have \ > X. Furthermore, we know that Yp = Yn, and Zn = Znąv for all 
suitably large n; therefore, by (4), Xn = Xn+x. This proves A< X. I 


Now we are ready to prove Theorem A. Lemma Q tells us that it suffices to 
prove the theorem when m is a power of a prime number, because 


py... pe =A=lem(Ay,..., At) < Ar..- At < py... De" 


will be true if and only if A; = p; forl <j <t. 

Assume therefore that m = pf, where p is prime and e is a positive integer. 
The theorem is obviously true when a = 1, so we may take a > 1. The period 
can be of length m if and only if each possible integer 0 < x < m occurs in 
the period, since no value occurs in the period more than once. Therefore the 
period is of length m if and only if the period of the sequence with Xo = 0 is of 
length m, and we are justified in supposing that Xp = 0. By formula 3.2.1-(6) 


we have 
n 


Xp = (= =>) c mod m. (5) 


q= 


If c is not relatively prime to m, this value X,, could never be equal to 1, so 
condition (i) of the theorem is necessary. The period has length m if and only 
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if the smallest positive value of n for which X, = Xo = 0 is n = m. By (5) and 
condition (i), our theorem now reduces to proving the following fact: 


Lemma R. Assume that 1 < a < p°, where p is prime. If X is the smallest 
positive integer for which (aò — 1)/(a — 1) = 0 (modulo p°), then 


a = 1 (modulo p) when p> 2, 


=p? if and only if 
í ewer ee oon 4) when p=2. 


Proof. Assume that A = p°. If a #1 (modulo p), then (a” — 1)/(a — 1) = 0 
(modulo p°) if and only if a” — 1 = 0 (modulo p°). The condition aP* — 1 = 0 
(modulo p°) then implies that a?" = 1 (modulo p); but by Theorem 1.2.4F we 
have a° = a (modulo p), hence a Æ 1 (modulo p) leads to a contradiction. And 
if p = 2 and a = 3 (modulo 4), we have 


(a2°*— 1)/(a—1) =0 (modulo 2°) 


by exercise 8. These arguments show that it is necessary in general to have 
a=1+qp!, where pf > 2 and q is not a multiple of p, whenever À = p°. 

It remains to be shown that this condition is sufficient to make A = pê. By 
repeated application of Lemma P, we find that 


a?” =1 (modulo pf+9), a” #1 (modulo pft9t), 
for all g > 0, and therefore 
(a /(a — 1) = 0 (modulo p9), 


(6) 
In particular, (a?° — 1)/(a — 1) = 0 (modulo p°). Now the congruential sequence 
(0,a,1, p°) has X, = (a”—1)/(a—1) mod p°; therefore it has a period of length 4, 
that is, Xn = 0 if and only if n is a multiple of A. Hence p° is a multiple of À. 
This can happen only if A = p for some g, and the relations in (6) imply that 
A= p°, completing the proof. J 


1) 
1)/(a — 1) £0 (modulo p9t?). 
) 


The proof of Theorem A is now complete. J 


We will conclude this section by considering the special case of pure mul- 
tiplicative generators, when c = 0. Although the random number generation 
process is slightly faster in this case, Theorem A shows us that the maximum 
period length cannot be achieved. In fact, this is quite obvious, since the sequence 
now satisfies the relation 

Xn41 = aXymodm, (7) 


and the value X,, = 0 should never appear, lest the sequence degenerate to zero. 
In general, if d is any divisor of m and if Xn is a multiple of d, all succeeding 
elements Xn+1, Xn+2, --. of the multiplicative sequence will be multiples of d. 
So when c = 0, we will want X,, to be relatively prime to m for all n, and this 
limits the length of the period to at most y(m), the number of integers between 
0 and m that are relatively prime to m. 
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It may be possible to achieve an acceptably long period even if we stipulate 
that c = 0. Let us now try to find conditions on the multiplier so that the period 
is as long as possible in this special case. 

According to Lemma Q, the period of the sequence depends entirely on the 
periods of the sequences when m = p*, so let us consider that situation. We 
have Xn = a” Xo mod p°, and it is clear that the period will be of length 1 if a is 
a multiple of p, so we take a to be relatively prime to p. Then the period is the 
smallest integer \ such that Xo = aòXo mod p°. If the greatest common divisor 
of Xo and p° is pf, this condition is equivalent to 

aò =1 (modulo p*f). (8) 
By Euler’s theorem (exercise 1.2.4-28), a?" 7) 
a divisor of 


= 1 (modulo p*~/); hence A is 


ep? 4) = ptp- 1). 
When a is relatively prime to m, the smallest integer \ for which aò = 1 
(modulo m) is conventionally called the order of a modulo m. Any such value 
of a that has the maximum possible order modulo m is called a primitive element 
modulo m. 

Let A(m) denote the order of a primitive element, namely the maximum 
possible order, modulo m. The remarks above show that A(p°) is a divisor of 
p®*(p—1); with a little care (see exercises 11 through 16 below) we can give 
the precise value of A(m) in all cases as follows: 


A(2) = 1, A(4) = 2, (25) 2°" if e>3; 
Np?) =p (p- 1), if p>2; (9) 
Api- py") = lem(A(pq"), ---, A@D;")). 
Our remarks may be summarized in the following theorem: 
Theorem B. [C. F. Gauss, Disquisitiones Arithmeticee (1801), §90-92.] The 
maximum period possible when c = 0 is A(m), where A(m) is defined in (9). 
This period is achieved if 
i) Xo is relatively prime to m; 
ii) a is a primitive element modulo m. | 
Notice that we can obtain a period of length m — 1 if m is prime; this is just one 
less than the maximum length, so for all practical purposes such a period is as 
long as we want. 
The question now is, how can we find primitive elements modulo m? The 
exercises at the close of this section tell us that there is a fairly simple answer 


when m is prime or a power of a prime, namely the results stated in our next 
theorem. 


Theorem C. The number a is a primitive element modulo p° if and only if one 
of the following cases applies: 


i) p= 2, e = 1, and a is odd; 
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ii) p= 2, e = 2, and a mod 4 = 3; 
iii) p = 2, e = 3, and a mod 8 = 3, 5, or 7; 
iv) p = 2, e > 4, and a mod 8 = 3 or 5; 

) 


iv 
v) p is odd, e = 1, a Æ 0 (modulo p), and a?-®/4 Æ 1 (modulo p) for any 
prime divisor q of p — 1; 


vi) p is odd, e > 1, a satisfies the conditions of (v), and a?~! Æ 1 (modulo p°). 
I 


Conditions (v) and (vi) of this theorem are readily tested on a computer for 
large values of p, by using the efficient methods for evaluating powers discussed 
in Section 4.6.3, if we know the factors of p — 1. 

Theorem C applies to powers of primes only. But if we are given values aj 
that are primitive modulo py , it is possible to find a single value a such that 
a = aj (modulo p3’), for 1 < j < t, using the Chinese remainder algorithm 
discussed in Section 4.3.2; this number a will be a primitive element modulo 
pi’... pyt. Hence there is a reasonably efficient way to construct multipliers 
satisfying the condition of Theorem B, for any modulus m of moderate size, 
although the calculations can be somewhat lengthy in the general case. 

In the common case m = 2°, with e > 4, the conditions above simplify to 
the single requirement that a = 3 or 5 (modulo 8). In this case, one-fourth of all 
possible multipliers will make the period length equal to m/4, and m/4 is the 
maximum possible when c = 0. 

The second most common case is when m = 10°. Using Lemmas P and Q, it 
is not difficult to obtain necessary and sufficient conditions for the achievement 
of the maximum period in the case of a decimal computer (see exercise 18): 


Theorem D. Ifm = 10°, e > 5, c = 0, and Xo is not a multiple of 2 or 5, the 
period of the linear congruential sequence is 5 x 10°~? if and only if a mod 200 
equals one of the following 32 values: 
3, 11, 13, 19, 21, 27, 29, 37, 53, 59, 61, 67, 69, 77, 83, 91, 109, 117, Go) 
123, 131, 133, 139, 141, 147, 163, 171, 173, 179, 181, 187, 189,197. Į “° 


EXERCISES 

1. [10] What is the length of the period of the linear congruential sequence with 
Xo = 5772156648, a = 3141592621, c = 2718281829, and m = 10000000000? 

2. [10] Are the following two conditions sufficient to guarantee the maximum length 
period, when m is a power of 2? “(i) c is odd; (ii) a mod 4 = 1.” 


3. [13] Suppose that m = 10°, where e > 2, and suppose further that c is odd and 
not a multiple of 5. Show that the linear congruential sequence will have the maximum 
length period if and only if a mod 20 = 1. 


4. [M20] Assume that m = 2° and Xo = 0. If the numbers a and c satisfy the 
conditions of Theorem A, what is the value of X --1? 


5. [14] Find all multipliers a that satisfy the conditions of Theorem A when m = 
2°° +1. (The prime factors of m may be found in Table 3.2.1.1-1.) 
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> 6. [20] Find all multipliers a that satisfy the conditions of Theorem A when m = 
10° — 1. (See Table 3.2.1.1-1.) 


> 7. [M23] The period of a congruential sequence need not start with Xo, but we can 
always find indices u > 0 and A > 0 such that Xn+\ = Xn whenever n > p, and for 
which pz and A are the smallest possible values with this property. (See exercises 3.1-6 
and 3.2.1-1.) If uj and A; are the indices corresponding to the sequences 


(Xo mod pî’, a mod p;’, cmod p;’, p;’), 


and if u and À correspond to the composite sequence (Xo, a,c, pî"... p;*), Lemma Q 
states that A is the least common multiple of A1,...,A+.. What is the value of p in 
terms of the values of j41,..., We? What is the maximum possible value of u obtainable 
by varying Xo, a, and c, when m = pî" ... p{* is fixed? 


8. [M20] Show that if amod 4 = 3, we have (a2 * — 1)/(a — 1) = 0 (modulo 2°) 
when e > 1. (Use Lemma P.) 


> 9. [M22] (W. E. Thomson.) When c= 0 and m = 2° > 16, Theorems B and C say 
that the period has length 2°~? if and only if the multiplier a satisfies a mod 8 = 3 
or amod 8 = 5. Show that every such sequence is essentially a linear congruential 
sequence with m = 2°7?, having full period, in the following sense: 


a) If Xn41 = (4c + 1)Xn mod 2°, and Xn = 4Y, + 1, then 


Yn+1 = ((4c + 1)¥n + c) mod 2°?. 


b) If Xn41 = (4c — 1) Xn mod 2°, and Xn = ((—1)” (4Yn + 1)) mod 2°, then 
Yn+1 = ((1 — 4c)¥n — c) mod 2°~?. 


[Note: In these formulas, c is an odd integer. The literature contains several 
statements to the effect that sequences with c = 0 satisfying Theorem B are somehow 
more random than sequences satisfying Theorem A, in spite of the fact that the period is 
only one-fourth as long in the case of Theorem B. This exercise refutes such statements; 
in essence, we must give up two bits of the word length in order to save the addition 
of c, when m is a power of 2.] 


10. [M21] For what values of m is A(m) = v(m)? 


> 11. [M28] Let x bean odd integer greater than 1. (a) Show that there exists a unique 
integer f > 1 such that x = 2/ +1 (modulo 2/t'). (b) Given that 1 < a < 2° — 1 and 
that f is the corresponding integer from part (a), show that the order of x modulo 2° 
is 2°". (c) In particular, this proves parts (i)-(iv) of Theorem C. 


12. [M26] Let p be an odd prime. If e > 1, prove that a is a primitive element 
modulo p° if and only if a is a primitive element modulo p and a?~' Æ 1 (modulo p°). 
(For the purposes of this exercise, assume that A(p°) = p*'(p—1). This fact is proved 
in exercises 14 and 16 below.) 


13. [M22] Let p be prime. Given that a is not a primitive element modulo p, show 
that either a is a multiple of p or af—)/4 = 1 (modulo p) for some prime number q 
that divides p — 1. 


14. [M18] If e > 1 and p is an odd prime, and if a is a primitive element modulo p, 
prove that either a or a+ p is a primitive element modulo p*. [Hint: See exercise 12.] 
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15. [M29] (a) Let a; and az be relatively prime to m, and let their orders modulo m 

be Ai and Az, respectively. If A is the least common multiple of A1 and A2, prove that 

a;'a5? has order \ modulo m, for suitable integers sı and «2. [Hint: Consider first 

the case that A; is relatively prime to A2.] (b) Let A(m) be the maximum order of 

any element modulo m. Prove that A(m) is a multiple of the order of each element 

modulo m; that is, prove that aX™ = 1 (modulo m) whenever a is relatively prime 

to m. (Do not use Theorem B.) 

16. [M24] (Existence of primitive roots.) Let p be a prime number. 

a) Consider the polynomial f(x) = x” ++cia"~'+---+ cn, where the œs are integers. 

Given that a is an integer for which f(a) = 0 (modulo p), show that there exists 
a polynomial 


q(x) = z”! t na”? bet i 


with integer coefficients such that f(x) = (a—a)q(x) (modulo p) for all integers x. 
b) Let f(x) be a polynomial as in (a). Show that f(x) has at most n distinct “roots” 
modulo p; that is, there are at most n integers a, with 0 < a < p, such that 
f(a) = 0 (modulo p). 
c) Because of exercise 15(b), the polynomial f(x) = 2*) —1 has p—1 distinct roots; 
hence there is an integer a with order p — 1. 


17. [M26] Not all of the values listed in Theorem D would be found by the text’s 
construction; for example, 11 is not primitive modulo 5°. How can this be possible, 
when 11 is primitive modulo 10°, according to Theorem D? Which of the values listed 
in Theorem D are primitive elements modulo both 2° and 5°? 


18. [M25] Prove Theorem D. (See the previous exercise.) 


19. [40] Make a table of some suitable multipliers, a, for each of the values of m listed 
in Table 3.2.1.1-1, assuming that c = 0. 


20. [M24] (G. Marsaglia.) The purpose of this exercise is to study the period length 
of an arbitrary linear congruential sequence. Let Yp = 1+a+---+a"+, so that 
Xn = (AY, + Xo) mod m for some constant A by Eq. 3.2.1-(8). 
a) Prove that the period length of (Xn) is the period length of (Yn mod m’), where 
m’ = m/gced(A,m). 
b) Prove that the period length of (Y„ mod p°} satisfies the following when p is prime: 
(i) If amod p = 0, it is 1. (ii) If amod p = 1, it is pf, except when p = 2 and 
e > 2 and a mod 4 = 3. (iii) If p = 2, e > 2, and a mod 4 = 3, it is twice the order 
of a modulo p° (see exercise 11), unless a = —1 (modulo 2°) when it is 2. (iv) If 
amod p > 1, it is the order of a modulo p°. 


21. [M25] In a linear congruential sequence of maximum period, let Xo = 0 and let s 
be the least positive integer such that a* = 1 (modulo m). Prove that gcd(X;,m) = s. 


22. [M25] Discuss the problem of finding moduli m = b" +b! +1 so that the subtract- 
with-borrow and add-with-carry generators of exercise 3.2.1.1-14 will have very long 
periods. 


3.2.1.3. Potency. In the preceding section, we showed that the maximum 
period can be obtained when b = a — 1 is a multiple of each prime dividing m; 
and b must also be a multiple of 4 if m is a multiple of 4. If z is the radix of 
the machine being used —so that z = 2 for a binary computer, and z = 10 fora 
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decimal computer — and if m is the word size z°, the multiplier 
a=2z* +1, 2<k<e (1) 


satisfies these conditions. Theorem 3.2.1.2A also says that we may take c = 1. 
The recurrence relation now has the form 


Xn = ((2 + Xn +1) mod 2°, (2) 


and this equation suggests that we can avoid the multiplication; merely shifting 
and adding will suffice. 
For example, suppose we choose a = B? +1, where B is the byte size of MIX. 
The code 
LDA X; SLA 2; ADD X; INCA 1 (3) 


can be used in place of the instructions given in Section 3.2.1.1, and the execution 
time decreases from 16u to Tu. 

For this reason, multipliers having form (1) have been widely discussed in the 
literature, and indeed they have been recommended by many authors. However, 
the early years of experimentation with this method showed conclusively that 
multipliers having the simple form in (1) should be avoided. The generated 
numbers just aren’t random enough. 

Later in this chapter we shall be discussing some rather sophisticated theory 
that accounts for the badness of all the linear congruential random number gen- 
erators known to be bad. However, some generators (such as (2)) are sufficiently 
awful that a comparatively simple theory can be used to rule them out. This 
simple theory is related to the concept of “potency,” which we shall now discuss. 

The potency of a linear congruential sequence with maximum period is 
defined to be the least integer s such that 


b = 0 (modulo m). (4) 


(Such an integer s will always exist when the multiplier satisfies the conditions 
of Theorem 3.2.1.2A, since b is a multiple of every prime dividing m.) 

We may analyze the randomness of the sequence by taking Xo = 0, since 0 
occurs somewhere in the period. With this assumption, Eq. 3.2.1-(6) reduces to 


Xn = ((a” — 1)c/b) mod m; 
and if we expand a” — 1 = (b + 1)” — 1 by the binomial theorem, we find that 


Xp =e(n+ (3) b+--+(") as mod m. (5) 


All terms in bê, b°+1, etc., may be ignored, since they are multiples of m. 

Equation (5) can be instructive, so we shall consider some special cases. 
If a = 1, the potency is 1; and X, = cn (modulo m), as we have already 
observed, so the sequence is surely not random. If the potency is 2, we have 
Xn =cn+ cb(5), and again the sequence is not very random; indeed, 


Xn+1 — Xn = c + cbn 
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in this case, so the differences between consecutively generated numbers change 
in a simple way from one value of n to the next. The point (Xn, Xn+1; Xn+2) 
always lies on one of the four planes 


zr—2y+z=d+m, xz—2y+z=d-m, 


g= +z =d, x —2y+z=d-— 2m, 


in three-dimensional space, where d = cb mod m. 

If the potency is 3, the sequence begins to look somewhat more random, 
but there is a high degree of dependency between Xn, Xn+1, and Xn+2; tests 
show that sequences with potency 3 are still not sufficiently good. Reasonable 
results have been reported when the potency is 4 or more, but they have been 
disputed by other people. A potency of at least 5 would seem to be required for 
sufficiently random values. 

Suppose, for example, that m = 235 and a = 2% +1. Then b = 2%, so 
we find that the value b? = 2?" is a multiple of m when k > 18: The potency 
is 2. If k = 17,16,...,12, the potency is 3, and a potency of 4 is achieved for 
k = 11,10,9. The only acceptable multipliers, from the standpoint of potency, 
therefore have k < 8. This means a < 257, and we shall see later that small 
multipliers are also to be avoided. We have now eliminated all multipliers of the 
form 2* + 1 when m = 2%. 

When m is equal to w + 1, where w is the word size, m is generally not 
divisible by high powers of primes, and a high potency is impossible (see exer- 
cise 6). So in this case, the maximum-period method should not be used; the 
pure-multiplication method with c = 0 should be applied instead. 

It must be emphasized that high potency is necessary but not sufficient 
for randomness; we use the concept of potency only to reject impotent genera- 
tors, not to accept the potent ones. Linear congruential sequences should pass 
the “spectral test” discussed in Section 3.3.4 before they are considered to be 
acceptably random. 


EXERCISES 


1. [M10] Show that, no matter what the byte size B of MIX happens to be, the code 
(3) yields a random number generator of maximum period. 


2. [10] What is the potency of the generator represented by the MIX code (3)? 


3. [11] When m = 2*°, what is the potency of the linear congruential sequence with 
a = 3141592621? What is the potency if the multiplier is a = 27? + 21° + 2? +1? 


4. [15] Show that if m = 2° > 8, maximum potency is achieved when a mod 8 = 5. 


5. [M20] Given that m = pî"... pft and a = 1+ kpt"... pf, where a satisfies the 
conditions of Theorem 3.2.1.2A and k is relatively prime to m, show that the potency 
is max([ex/fil,.-.; [ee/fel)- 

> 6. [20] Which of the values of m = w+ 1 in Table 3.2.1.1-1 can be used in a linear 
congruential sequence of maximum period whose potency is 4 or more? (Use the result 
of exercise 5.) 
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7. [M20] When a satisfies the conditions of Theorem 3.2.1.2A, it is relatively prime 
to m; hence there is a number a’ such that aa’ = 1 (modulo m). Show that a’ can be 
expressed simply in terms of b. 

8. [M26] A random number generator defined by Xn41 = (217 + 3) Xn mod 2%” and 
Xo = 1 was subjected to the following test: Let Yn = |20.X, /2°°|; then Y, should bea 
random integer between 0 and 19, and the triples (Y3n, Yan+1, Y3n+2) should take on 
each of the 8000 possible values from (0, 0,0) to (19,19, 19) with nearly equal frequency. 
But with 1,000,000 values of n tested, many triples never occurred, and others occurred 
much more often than they should have. Can you account for this failure? 


3.2.2. Other Methods 


Of course, linear congruential sequences are not the only sources of random num- 
bers that have been proposed for computer use. In this section we shall review 
the most significant alternatives. Some of these methods are quite important, 
while others are interesting chiefly because they are not as good as a person 
might expect. 

One of the common fallacies encountered in connection with random number 
generation is the idea that we can take a good generator and modify it a little, in 
order to get an “even more random” sequence. This is often false. For example, 
we know that 

Xn+1 = (aX, +c) mod m (1) 


leads to reasonably good random numbers; wouldn’t the sequence produced by 
Xn41 = ((aXn) mod (m + 1) + c) mod m (2) 


be even more random? The answer is, the new sequence is probably a great deal 
less random. For the whole theory breaks down, and in the absence of any theory 
about the behavior of the sequence (2), we come into the area of generators of 
the type Xn41 = f(Xn) with the function f chosen at random; exercises 3.1—11 
through 3.1-15 show that these sequences probably behave much more poorly 
than the sequences obtained from the more disciplined function (1). 

Let us consider another approach, in an attempt to obtain a genuine im- 
provement of sequence (1). The linear congruential method can be generalized 
to, say, a quadratic congruential method: 


Xn4i = (dX? + aXn + c) mod m. (3) 


Exercise 8 generalizes Theorem 3.2.1.2A to obtain necessary and sufficient con- 
ditions on a, c, and d such that the sequence defined by (3) has a period of the 
maximum length m; the restrictions are not much more severe than in the linear 
method. 

An interesting quadratic method has been proposed by R. R. Coveyou when 
m is a power of two: Let 


Xymod4=2,  Xn41=Xn(Xn+1)mod2%,  n>0. (4) 


This sequence can be computed with about the same efficiency as (1), without 
any worries of overflow. It has an interesting connection with von Neumann’s 
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original middle-square method: If we let Y„ be 2°X,, so that Y, is a double- 
precision number obtained by placing e zeros to the right of the binary represen- 
tation of Xn, then Y,,41 consists of precisely the middle 2e digits of Y,? + 2°Y,! 
In other words, Coveyou’s method is almost identical to a somewhat degenerate 
double-precision middle-square method, yet it is guaranteed to have a long 
period; further evidence of its randomness is proved in Coveyou’s paper cited 
in the answer to exercise 8. 


Other generalizations of Eq. (1) also suggest themselves; for example, we 
might try to extend the period length of the sequence. The period of a linear 
congruential sequence is fairly long; when m is approximately the word size of 
the computer, we usually get periods on the order of 10° or more, and typical 
calculations will use only a very small portion of the sequence. On the other hand, 
when we discuss the idea of “accuracy” in Section 3.3.4 we will see that the period 
length influences the degree of randomness achievable in a sequence. Therefore it 
can be desirable to seek a longer period, and several methods are available for this 
purpose. One technique is to make Xn+ı depend on both Xn and Xn-—1, instead 
of just on Xn; then the period length can be as high as m?, since the sequence will 
not begin to repeat until we have (Xn+), Xn+4a41) = (Xn, Xn41). John Mauchly, 
in an unpublished paper presented to a statistics conference in 1949, extended 
the middle square method by using the recurrence X,, = middle (Xp_1- Xn_6). 

The simplest sequence in which Xn+1ı depends on more than one of the 
preceding values is the Fibonacci sequence, 


Xn+1 = (Xn + Xn_1) mod m. (5) 


This generator was considered in the early 1950s, and it usually gives a period 
length greater than m. But tests have shown that the numbers produced by 
the Fibonacci recurrence are definitely not satisfactorily random, and so our 
main interest in (5) as a source of random numbers is that it makes a nice “bad 
example.” We may also consider generators of the form 


Xn41 = (Xn + Xn-k) mod M, (6) 


when k is a comparatively large value. This recurrence was introduced by Green, 
Smith, and Klem [JACM 6 (1959), 527-537], who reported that, when k < 15, 
the sequence fails to pass the “gap test” described in Section 3.3.2, although 
when k = 16 the test was satisfactory. 

A much better type of additive generator was devised in 1958 by G. J. 
Mitchell and D. P. Moore [unpublished], who suggested the somewhat unusual 
sequence defined by 


Xn = (Xn-24 + Xn—55) mod m, n > 55, (7) 


where m is even, and where Xo,..., X54 are arbitrary integers not all even. The 
constants 24 and 55 in this definition were not chosen at random; they are special 
values that happen to define a sequence whose least significant bits, (X,, mod 2), 
will have a period of length 2°° — 1. Therefore the sequence (X,,) must have 
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a period at least this long. Exercise 30 proves that (7) has a period of length 
exactly 2°—1(2°° — 1) when m = 2°. 

At first glance Eq. (7) may not seem to be extremely well suited to machine 
implementation, but in fact there is a very efficient way to generate the sequence 
using a cyclic list: 


Algorithm A (Additive number generator). Memory cells Y[1], Y [2], ..., Y [55] 
are initially set to the values X54, X53, ..., Xo, respectively; j is initially equal 
to 24 and k is 55. Successive performances of this algorithm will produce the 
numbers X55, X56,... as output. 


A1. {Add.] (If we are about to output X,, at this point, Y[j] now equals X,_24 
and Y [|k] equals Xn-55.) Set Y[k] + (Y [k] +Y[j]) mod 2°, and output Y [k]. 

A2. [Advance.] Decrease j and k by 1. If now j = 0, set j + 55; otherwise if 
k = 0, set k + 55. (We cannot have both j =Oandk=0.) 1 


This algorithm in MIX is simply the following: 


Program A (Additive number generator). Assuming that index registers 5 
and 6, representing 7 and k, are not touched by the remainder of the program in 
which this routine is embedded, the following code performs Algorithm A and 
leaves the result in register A. 


LDA Y,6 Al. Add. 


ADD Y,5 Y;,+ Y; (overflow possible) 
STA Y,6 > Yr. 

DEC5 1 A2. Advance. j4 j—1. 
DEC6 1 kek-1. 

J5P *+2 

ENT5 55 If 7 =0, set j < 55. 

J6P *+2 


ENT6 55 Ifk=0, set k+ 55. I 


This generator is usually faster than the other methods we have been dis- 
cussing, since it does not require any multiplication. Besides its speed, it has 
the longest period we have seen yet, except in exercise 3.2.1.2—22. Furthermore, 
as Richard Brent has observed, it can be made to work correctly with floating 
point numbers, avoiding the need to convert between integers and fractions (see 
exercise 23). Therefore it may well prove to be the very best source of random 
numbers for practical purposes. The main reason why it is difficult to recommend 
sequences like (7) wholeheartedly is that there is still very little theory to prove 
that they do or do not have desirable randomness properties; essentially all we 
know for sure is that the period is very long, and this is not enough. John Reiser 
(Ph.D. thesis, Stanford University, 1977) has shown, however, that an additive 
sequence like (7) will be well distributed in high dimensions, provided that a 
certain plausible conjecture is true (see exercise 26). 

The numbers 24 and 55 in (7) are commonly called lags, and the numbers 
Xn defined by (7) are said to form a lagged Fibonacci sequence. Lags like 
(24,55) work well because of theoretical results developed in some of the exercises 
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Table 1 
LAGS THAT YIELD LONG PERIODS MOD 2 


(24,55) (37,100) (83,258) (273,607) (576,3217) (7083, 19937) 
(38,89) (30,127) (107,378)  (1029,2281)  (4187,9689) (9739, 23209) 


For extensions of this table, see N. Zierler and J. Brillhart, Information and Control 13 (1968), 
541-554, 14 (1969), 566-569, 15 (1969), 67-69; Y. Kurita and M. Matsumoto, Math. Comp. 
56 (1991), 817-821; Heringa, Bléte, and Compagner, Int. J. Mod. Phys. C3 (1992), 561-564. 


below. It is of course better to use somewhat larger lags when an application 
happens to use, say, groups of 55 values at a time; the numbers generated by (7) 
will never have X,, lying strictly between X,-24 and Xn—55 (see exercise 2). 
J.-M. Normand, H. J. Herrmann, and M. Hajjar detected slight biases in the 
numbers generated by (7) when they did extensive high-precision Monte Carlo 
studies requiring 101! random numbers [J. Statistical Physics 52 (1988), 441- 
446]; but larger values of k decreased the bad effects. Table 1 lists several useful 
pairs (l, k) for which the sequence Xn = (Xn—1 + Xn-k) mod 2° has period 
length 2°-1(2* — 1). The case (l, k) = (30,127) should be large enough for 
most applications, especially in combination with other randomness-enhancing 
techniques that we will discuss later. 

George Marsaglia [Comp. Sci. and Statistics: Symposium on the Interface 
16 (1984), 3-10] has suggested replacing (7) by 


Xn a (Xn—24 à Knees) mod M, n> 55, (7) 


where m is a multiple of 4 and where Xo through X54 are odd, not all congruent 
to 1 (modulo 4). Then the second-least significant bits have a period of 255 — 1, 
while the most significant bits are more thoroughly mixed than before since they 
depend on all bits of X,_24 and Xņn-55 in an essential way. Exercise 31 shows 
that the period length of sequence (7’) is only slightly less than that of (7). 

Lagged Fibonacci generators have been used successfully in many situations 
since 1958, so it came as a shock to discover in the 1990s that they actually fail 
an extremely simple, non-contrived test for randomness (see exercise 3.3.2-31). 
A workaround that avoids such problems by discarding appropriate elements of 
the sequence is described near the end of this section. 


Instead of considering purely additive or purely multiplicative sequences, 
we can construct useful random number generators by taking general linear 
combinations of Xn_1, ..., Xn—~ for small k. In this case the best results 
occur when the modulus m is a large prime; for example, m can be chosen to be 
the largest prime number that fits in a single computer word (see Table 4.5.4—2). 
When m = pis prime, the theory of finite fields tells us that it is possible to find 
multipliers a,,...,a@, such that the sequence defined by 


Xn = (ay Xn_1 ee oa apXn—k) mod p (8) 


has period length p* — 1; here Xo,...,X,—1 may be chosen arbitrarily but not 
all zero. (The special case k = 1 corresponds to a multiplicative congruential se- 
quence with prime modulus, with which we are already familiar.) The constants 
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@1,...,@, in (8) have the desired property if and only if the polynomial 
f(x) = x" — aa! —.---— a (9) 


is a “primitive polynomial modulo p,” that is, if and only if this polynomial 
has a root that is a primitive element of the field with p” elements (see exercise 
4.6.2-16). 

Of course, the mere fact that suitable constants a1, ..., a, exist giving a 
period of length p* — 1 is not enough for practical purposes; we must be able to 
find them, and we can’t simply try all p* possibilities, since p is on the order 
of the computer’s word size. Fortunately there are exactly y(p* — 1)/k suitable 
choices of (a1,...,@%), so there is a fairly good chance of hitting one after making 
a few random tries. But we also need a way to tell quickly whether or not (9) 
is a primitive polynomial modulo p; it is certainly unthinkable to generate up 
to p? — 1 elements of the sequence and wait for a repetition! Methods of testing 
for primitivity modulo p are discussed by Alanen and Knuth in Sankhya A26 
(1964), 305-328. The following criteria can be used: Let r = (p* — 1)/(p — 1). 


i) (—1)*-'a, must be a primitive root modulo p. (See Section 3.2.1.2.) 
ii) The polynomial z” must be congruent to (—1)*~la,, modulo f(a) and p. 
iii) The degree of x”/4 mod f(x), using polynomial arithmetic modulo p, must 

be positive, for each prime divisor q of r. 

Efficient ways to compute the polynomial z” mod f(x), using polynomial 
arithmetic modulo a given prime p, are discussed in Section 4.6.2. 

In order to carry out this test, we need to know the prime factorization of 
r = (p* — 1)/(p — 1), and this is the limiting factor in the calculation; r can 
be factored in a reasonable amount of time when k = 2, 3, and perhaps 4, but 
higher values of k are difficult to handle when p is large. Even k = 2 essentially 
doubles the number of “significant random digits” over what is achievable with 
k = 1, so larger values of k will rarely be necessary. 

An adaptation of the spectral test (Section 3.3.4) can be used to rate the 
sequence of numbers generated by (8); see exercise 3.3.4-24. The considerations 
of that section show that we should not make the obvious choice of a; = +1 or 
—1 when a primitive polynomial of that form exists; it is better to pick large, 
essentially “random” values of a1, ..., ap that satisfy the conditions, and to verify 
the choice by applying the spectral test. A significant amount of computation 
is involved in finding a,,...,az, but all known evidence indicates that the result 
will be a very satisfactory source of random numbers. We essentially achieve the 
randomness of a linear congruential generator with k-tuple precision, using only 
single precision operations. 

The special case p = 2 is of independent interest. Sometimes a random 
number generator is desired that merely produces a random sequence of bits— 
zeros and ones — instead of fractions between zero and one. There is a simple way 
to generate a highly random bit sequence on a binary computer, manipulating 
k-bit words: Start with an arbitrary nonzero binary word X. To get the next 
random bit of the sequence, do the following operations, shown in MIX’s language 
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(see exercise 16): 


LDA X (Assume that overflow is now “off.”) 
ADD X Shift left one bit. 


JNOV *+2 Jump if the high bit was originally zero. (10) 
XOR A Otherwise adjust the number with “exclusive or.” 
STA X I 


The fourth instruction here is the “exclusive or” operation found on nearly all 
binary computers (see exercise 2.5-28 and Section 7.1.3); it changes each bit 
position of rA in which location A has a “1” bit. The value in location A is 
the binary constant (a,...a%)2, where zë — ag"! — --- — ap is a primitive 
polynomial modulo 2 as above. After the code (10) has been executed, the next 
bit of the generated sequence may be taken as the least significant bit of word X. 
Alternatively, we could consistently use the most significant bit of X, if the most 


significant bit is more convenient. 


1011 
0101 
1010 
0111 
1110 
1111 
1101 
1001 
0001 
0010 
0100 
o Fig. 1. Successive contents of the computer word X in the binary 
0110 method, assuming that k = 4 and CONTENTS (A) = (0011)2. 


1100 

1011 

For example, consider Fig. 1, which illustrates the sequence generated for 
k = 4 and CONTENTS (A) = (0011)2. This is, of course, an unusually small value 
for k. The right-hand column shows the sequence of bits of the sequence, namely 
1101011110001001 ..., repeating in a period of length 2% —1 = 15. This sequence 
is quite random, considering that it was generated with only four bits of memory; 
to see this, consider the adjacent sets of four bits occurring in the period, namely 
1101, 1010, 0101, 1011, 0111, 1111, 1110, 1100, 1000, 0001, 0010, 0100, 1001, 
0011, 0110. In general, every possible adjacent set of k bits occurs exactly once 
in the period, except the set of all zeros, since the period length is 2* — 1; thus, 
adjacent sets of k bits are essentially independent. We shall see in Section 3.5 
that this is a very strong criterion for randomness when k is, say, 30 or more. 
Theoretical results illustrating the randomness of this sequence are given in an 
article by R. C. Tausworthe, Math. Comp. 19 (1965), 201-209. 

Primitive polynomials modulo 2 of degree < 168 have been tabulated by 
W. Stahnke, Math. Comp. 27 (1973), 977-980. When k = 35, we may take 


CONTENTS (A) = (00000000000000000000000000000000101)2, 


but the considerations of exercises 18 and 3.3.4—24 imply that it would be better 
to find “random” constants that define primitive polynomials modulo 2. 
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Caution: Several people have been trapped into believing that this random 
bit-generation technique can be used to generate random whole-word fractions 
(.XoX1...Xg-1)2, (XeX 41... X2~-1)2, -.-; but it is actually a poor source 
of random fractions, even though the bits are individually quite random. Exer- 
cise 18 explains why. 

Mitchell and Moore’s additive generator (7) is essentially based on the 
concept of primitive polynomials: The polynomial x°° + 274 + 1 is primitive, 
and Table 1 is essentially a listing of certain primitive trinomials modulo 2. 
A generator almost identical to that of Mitchell and Moore was independently 
discovered in 1971 by T. G. Lewis and W. H. Payne [JACM 20 (1973), 456-468], 
but using “exclusive or” instead of addition; this makes the period length exactly 
2°° — 1. Each bit position in the sequence of Lewis and Payne runs through the 
same periodic sequence, but has its own starting point. Experience has shown 
that (7) gives better results. 

We have now seen that sequences with 0 < X, < m and period m* — 1 
can be constructed without great difficulty, when X,, is a suitable function of 
Xn-1,---,Xn—z and when m is prime. The highest conceivable period for any 
sequence defined by a relation of the form 


Ay = Ft sp Mee) 0<Xn< m, (11) 


is easily seen to be m*. M. H. Martin [Bull. Amer. Math. Soc. 40 (1934), 859- 
864] was the first person to show that functions achieving this maximum period 
are possible for all m and k. His method is easy to state (exercise 17) and 
reasonably efficient to program (exercise 29), but it is unsuitable for random 
number generation because it changes the value of Xn-1 +--- + Xn-k very 
slowly: All k-tuples occur, but not in a very random order. A better class of 
functions f that yield the maximum period m* is considered in exercise 21. 
The corresponding programs are, in general, not as efficient for random number 
generation as other methods we have described, but they do give demonstrable 
randomness when the period as a whole is considered. 

Many other schemes have been proposed for random number generation. 
The most interesting of these alternative methods may well be the inversive 
congruential sequences suggested by Eichenauer and Lehn [Statistische Hefte 27 
(1986), 315-326]: 

Xn4i = (aX + c) modp. (12) 


Here p is prime, Xn ranges over the set {0,1,..., p— 1, co}, and inverses are 
defined by 07t = œ, œ7! = 0, otherwise X-X = 1 (modulo p). Since 
0 is always followed by œo and then by c in this sequence, we could simply 
define 071 = 0 for purposes of implementation; but the theory is cleaner and 
easier to develop when 07! = oo. Efficient algorithms suitable for hardware 
implementation are available for computing X~! modulo p; see, for example, 
exercise 4.5.2-39. Unfortunately, however, this operation is not in the repertoire 
of most computers. Exercise 35 shows that many choices of a and c yield the 
maximum period length p+ 1. Exercise 37 demonstrates the most important 
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property: Inversive congruential sequences are completely free of the lattice 
structure that is characteristic of linear congruential sequences. 

Another important class of techniques deals with the combination of random 
number generators. There will always be people who feel that the linear con- 
gruential methods, additive methods, etc., are all too simple to give sufficiently 
random sequences; and it may never be possible to prove that their skepticism 
is unjustified — indeed, they may be right—so it is pretty useless to argue the 
point. There are reasonably efficient ways to combine two sequences into a third 
one that should be haphazard enough to satisfy all but the most hardened skeptic. 

Suppose we have two sequences Xo, X1, ... and Yo, Y1,... of random numbers 
between 0 and m — 1, preferably generated by two unrelated methods. Then we 
can, for example, use one random sequence to permute the elements of another, 
as suggested by M. D. MacLaren and G. Marsaglia [JACM 12 (1965), 83-89; 
see also Marsaglia and Bray, CACM 11 (1968), 757-759]: 


Algorithm M (Randomizing by shuffling). Given methods for generating two 
sequences (X,,) and (Yn), this algorithm will successively output the terms of 
a “considerably more random” sequence. We use an auxiliary table V [0], V [1], 
..., V[k — 1], where k is some number chosen for convenience, usually in the 
neighborhood of 100. Initially, the V-table is filled with the first k values of the 
X-sequence. 


M1. [Generate X,Y.] Set X and Y equal to the next members of the sequences 
Xn) and (Y;,), respectively. 


( 
M2. [Extract j.] Set j < |kY/m], where m is the modulus used in the sequence 
(Yn); that is, j is a random value, 0 < j < k, determined by Y. 


M3. [Exchange.] Output V[j] and then set V[j]} << X. 1 


As an example, assume that Algorithm M is applied to the following two 
sequences, with k = 64: 


Xo = 5772156649, Xn+1 = (3141592653.X,, + 2718281829) mod 2°: 


1 
Yo = 1781072418, Yni1 = (2718281829Y,, + 3141592653) mod 2%. (33) 


On intuitive grounds it appears safe to predict that the sequence obtained by 
applying Algorithm M to (13) will satisfy virtually anyone’s requirements for 
randomness in a computer-generated sequence, because the relationship between 
nearby terms of the output has been almost entirely obliterated. Furthermore, 
the time required to generate this sequence is only slightly more than twice as 
long as it takes to generate the sequence (Xn) alone. 

Exercise 15 proves that the period length of Algorithm M’s output will be the 
least common multiple of the period lengths of (Xn) and (Yn), in most situations 
of practical interest. In particular, if we reject the value 0 when it occurs in the 
Y-sequence, so that (Yn) has period length 235 — 1, the numbers generated by 
Algorithm M from (13) will have a period of length 27° — 235, [See J. Arthur 
Greenwood, Computer Science and Statistics: Symposium on the Interface 9 
(1976), 222-227.] 
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However, there is an even better way to shuffle the elements of a sequence, 
discovered by Carter Bays and S. D. Durham [ACM Trans. Math. Software 2 
(1976), 59-64]. Their approach, although it appears to be superficially similar to 
Algorithm M, can give surprisingly better performance even though it requires 
only one input sequence (Xn) instead of two: 


Algorithm B (Randomizing by shuffling). Given a method for generating a 
sequence (Xn), this algorithm will successively output the terms of a “consider- 
ably more random” sequence, using an auxiliary table V[0], V[1], ..., V[k — 1] 
as in Algorithm M. Initially the V-table is filled with the first k values of the 
X-sequence, and an auxiliary variable Y is set equal to the (k + 1)st value. 


B1. [Extract j.] Set j + |kY/m], where m is the modulus used in the sequence 
(Xn); that is, j is a random value, 0 < j < k, determined by Y. 


B2. [Exchange.] Set Y + V[j], output Y, and then set V[j] to the next member 
of the sequence (Xn). I 


The reader is urged to work exercises 3 and 5, in order to get a feeling for 
the difference between Algorithms M and B. 

On MIX we may implement Algorithm B by taking k equal to the byte size, 
obtaining the following simple generation scheme once the initialization has been 
done: 


LD6 Y(1:1) j + high-order byte of Y. 


LDA X rA & Xn. 

INCA 1 (see exercise 3.2.1.1-1) 

MUL A rX < Xn41- 

STX X “nent” (14) 
LDA V,6 

STA Y Y¢V{jl- 

STX V,6 Vij} Xn. 1 


The output appears in register A. Notice that Algorithm B requires only 
four instructions of overhead per generated number. 

F. Gebhardt [Math. Comp. 21 (1967), 708-709] found that satisfactory 
random sequences were produced by Algorithm M even when it was applied 
to a sequence as nonrandom as the Fibonacci sequence, with Xn = Fo, mod m 
and Yn = Fon41 mod m. However, it is also possible for Algorithm M to produce 
a sequence less random than the original sequences, if (X,,) and (Y,,) are strongly 
related, as shown in exercise 3. Such problems do not seem to arise with 
Algorithm B. Since Algorithm B won’t make a sequence any less random, 
and since it enhances the randomness with very little extra cost, it can be 
recommended for use in combination with any other random number generator. 

Shuffling methods have an inherent defect, however: They change only 
the order of the generated numbers, not the numbers themselves. For most 
purposes the order is the critical thing, but if a random number generator fails 
the “birthday spacings” test discussed in Section 3.3.2 or the random walk test of 
exercise 3.3.2—31 it will not fare much better after it has been shuffled. Shuffling 
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also has the comparative disadvantage that it does not allow us to start at a 
given place in the period, or to skip quickly from Xn to X,+, for large k. 

Many people have therefore suggested combining two sequences (Xn) and 
(Yn) in a much simpler way, which avoids both of the defects of shuffling: We 
can use a combination like 


Zn = (Xn — Yn) mod m (15) 


when 0 < Xn < m and 0 < Y, < m’ < m. Exercises 13 and 14 discuss the period 
length of such sequences; exercise 3.3.2-23 shows that (15) tends to enhance the 
randomness when the seeds Xo and Yo are chosen independently. 

An even simpler way to remove the structural biases of arithmetically gen- 
erated numbers was proposed already in the early days of computing by J. Todd 
and O. Taussky Todd [Symp. on Monte Carlo Methods (Wiley, 1956), 15-28]: 
We can just throw away some numbers of the sequence. Their suggestion was of 
little use with linear congruential generators, but it has become quite appropriate 
nowadays in connection with generators like (7) that have extremely long periods, 
because we have plenty of numbers to discard. 

The simplest way to improve the randomness of (7) is to use only every jth 
term, for some small j. But a better scheme, which may be even simpler, is to use 
(7) to produce, say, 500 random numbers in an array and to use only the first 55 of 
them. After those 55 have been consumed, we generate 500 more in the same way. 
This idea was proposed by Martin Liischer [Computer Physics Communications 
79 (1994), 100-110], motivated by the theory of chaos in dynamical systems: We 
can regard (7) as a process that maps 55 values (X,_55,--.,Xn—1) into another 
vector of 55 values (Xn41~55,---,Xn+t-1). Suppose we generate t > 55 values 
and use the first 55 of them. Then if t = 55 the new vector of values is rather close 
to the old; but if t ~ 500 there is almost no correlation between old and new (see 
exercise 33). For the analogous case of add-with-carry or subtract-with-borrow 
generators (exercise 3.2.1.1-14), the vectors are in fact known to be the radix-b 
representation of numbers in a linear congruential generator, and the relevant 
multiplier when we generate t numbers at a time is b~*. Liischer’s theory for this 
case can therefore be confirmed with the spectral test of Section 3.3.4. A portable 
random number generator, based on a lagged Fibonacci sequence enhanced with 
Liischer’s approach, appears in Section 3.6, together with further commentary. 


Random number generators typically do only a few multiplications and/or 
additions to get from one element of the sequence to the next. When such 
generators are combined as suggested above, common sense tells us that the 
resulting sequences ought to be indistinguishable from truly random numbers. 
But intuitive hunches are no substitute for rigorous mathematical proof. If we are 
willing to do more work — say 1000 or 1000000 times as much— we can obtain 
sequences for which substantially better theoretical guarantees of randomness 
are available. 

For example, consider the sequence of bits B1, B2, ... generated by 


Xnii = X? mod M, Bn = Xn mod 2, (16) 
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(Blum, Blum, and Shub, SICOMP 15 (1986), 364-383], or the more elaborate 
sequence generated by 


Xn+1 = X? mod M, Bn = Xn-Z mod 2, (17) 


where the dot product of r-bit binary numbers (#,_1...%g)2 and (2,1... Zo)2 
is G1 Zp_1 +--+ +2920; here Z is an r-bit “mask,” and r is the number of bits 
in M. The modulus M should be the product of two large primes of the form 
4k + 3, and the starting value Xo should be relatively prime to M. Rule (17), 
suggested by Leonid Levin, is a take-off on von Neumann’s original middle-square 
method; we will call it the muddle-square method, because it jumbles the bits of 
the squares. Rule (16) is, of course, the special case Z = 1. 

Section 3.5F contains a proof that, when Xo, Z, and M are chosen at 
random, the sequences generated by (16) and (17) pass all statistical tests for 
randomness that require no more work than factoring large numbers. In other 
words, the bits cannot be distinguished from truly random numbers by any 
computation lasting less than 100 years on today’s fastest computers, when M 
is suitably large, unless it is possible to find the factors of a nontrivial fraction of 
such numbers much more rapidly than is presently known. Formula (16) is 
simpler than (17), but the modulus M in (16) has to be somewhat larger than 
it does in (17) if we want to achieve the same statistical guarantees. 


EXERCISES 


1. [12] In practice, we form random numbers using Xn+41 = (a Xn +c) mod m, where 
the X’s are integers, afterwards treating them as the fractions Un = Xn/m. The 
recurrence relation for Un is actually 


Un+1 = (aUn + ¢/m) mod 1. 
Discuss the generation of random sequences using this relation directly, by making use 
of floating point arithmetic on the computer. 


2. [M20] A good source of random numbers will have Xn-1 < Xn+ı < Xn about 
one-sixth of the time, since each of the six possible relative orders of Xn-1, Xn, and 
Xn+1 should be equally probable. However, show that the ordering above never occurs 
if the Fibonacci sequence (5) is used. 


3. [23] (a) What sequence comes from Algorithm M if 
Xo =0, Xn41 = (5Xn + 3) mod 8, Yo=0, Yn4i = (5Y, +1) mod 8, 


and k = 4? (Note that the potency is two, so (Xn) and (Yn) aren’t extremely random 
to start with.) (b) What happens if Algorithm B is applied to this same sequence (Xn) 
with k = 4? 

4. [00] Why is the most significant byte used in the first line of program (14), instead 
of some other byte? 

5. [20] Discuss using Xn = Yn in Algorithm M, in order to improve the speed of 
generation. Is the result analogous to Algorithm B? 

6. [10] In the binary method (10), the text states that the low-order bit of X is 
random, if the code is performed repeatedly. Why isn’t the entire word X random? 
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7. [20] Show that a complete sequence of length 2° (that is, a sequence in which 
each of the 2° possible sets of e adjacent bits occurs just once in the period) may be 
obtained if program (10) is changed to the following: 


LDA X LDA A JNOV *+3 XOR A 
JANZ *+2 ADD X JAZ *+2 STA X ] 


8. [M39] Prove that the quadratic congruential sequence (3) has period length m if 
and only if the following conditions are satisfied: 


i) c is relatively prime to m; 
ii) d and a — 1 are both multiples of p, for all odd primes p dividing m; 
iii) d is even, and d = a — 1 (modulo 4), if m is a multiple of 4; 
d=a-—1 (modulo 2), ifm is a multiple of 2; 
iv) d Æ 3c (modulo 9), if m is a multiple of 9. 


[Hint: The sequence defined by Xo = 0, Xn41 = dX2+aXn+c modulo m has a period 
of length m only if the same sequence modulo any divisor r of m has period length r.] 


9. [M24] (R. R. Coveyou.) Use the result of exercise 8 to prove that the modified 
middle-square method (4) has a period of length 2°~?. 


10. [M29] Show that if Xo and X, are not both even and if m = 2°, the period of 
the Fibonacci sequence (5) is 3 - 2°71. 


11. [M36] The purpose of this exercise is to analyze certain properties of integer 
sequences satisfying the recurrence relation 


Xn = a1 Xn-1 +++: + akXn-k, n= k. 


If we can calculate the period length of this sequence modulo m = p°, when p is prime, 
the period length with respect to an arbitrary modulus m is the least common multiple 
of the period lengths for the prime power factors of m. 

a) If f(z), a(z), b(z) are polynomials with integer coefficients, let us write a(z) = b(z) 
(modulo f(z) and m) if a(z) = b(z) + f(z)u(z) + mv(z) for some polynomials u(z) 
and vu(z) with integer coefficients. Prove that the following statement holds when 
f(0) = 1 and p° > 2: If zò = 1 (modulo f(z) and p°) and zò # 1 (modulo 
f(z) and p®*1), then z?* = 1 (modulo f(z) and p**') and 2P # 1 (modulo f(z) 
and p®*?). 

b) Let f(z) = 1 — az — -- - — akz", and let 


G(z) =1/f(2) = Ao + Aiz + Azz? +--+ 


Let A(m) denote the period length of (An mod m}. Prove that A(m) is the smallest 
positive integer \ such that zò = 1 (modulo f(z) and m). 

c) Given that p is prime, p° > 2, and A(p°) Æ A(p*t"), prove that A(p°t") = p” A(p°) 
for all r > 0. (Thus, to find the period length of the sequence (An mod 2°), we 
can compute A(4), A(8), A(16), ... until we find the smallest e > 3 such that 
A(2°) # A(4); then the period length is determined mod 2° for all e. Exercise 
4.6.3-26 explains how to calculate X,, for large n in O(log n) operations.) 

d) Show that any sequence of integers satisfying the recurrence stated at the begin- 
ning of this exercise has the generating function g(z)/f(z), for some polynomial 
g(z) with integer coefficients. 

e) Given that the polynomials f(z) and g(z) in part (d) are relatively prime modulo p 
(see Section 4.6.1), prove that the sequence (Xn mod pf) has exactly the same 
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period length as the special sequence (An mod pê) in (b). (No longer period could 
be obtained by any choice of Xo,..., X~-1, since the general sequence is a linear 
combination of “shifts” of the special sequence.) [Hint: By exercise 4.6.2—22 
(Hensel’s lemma), there exist polynomials such that a(z)f(z) + b(z)g(z) = 1 
(modulo p*).] 

12. [M28] Find integers Xo, Xi, a, b, and c such that the sequence 


Xn4i = (aXn + bXn—1 + c) mod 2°, n> 1, 


has the longest period length of all sequences of this type. [Hint: It follows that 
Xn42= ((a + 1)Xn4i + (b — a)Xn — bXn-1) mod 2°; see exercise 11(c).] 

13. [M20] Let (Xn) and (Yn) be sequences of integers mod m with periods of lengths 
Ai and A2, and combine them by letting Zn = (Xn + Yn) modm. Show that if A1 
and 2 are relatively prime, the sequence (Z,,) has a period of length A1,2. 


14. [M24] Let Xn, Yn, Zn, A1, Ag be as in the previous exercise. Suppose that the 
prime factorization of A1 is 2°23°25° ..., and similarly suppose that Ag = 2/23/3545... 
Let gp = (max(ep, fp) if ep Æ fp, otherwise 0), and let Ao = 292393595 .... Show that 
the period length à’ of the sequence (Zn) is a multiple of ào, and it is a divisor of 
à = Iem(A1, A2). In particular, A’ = A if (ep # fp or ep = fp = 0) for each prime p. 

15. [M27] Let the sequence (Xn) in Algorithm M have period length 1, and assume 
that all elements of its period are distinct. Let qn = min{r | r > 0 and |kYn-r/m] = 
[kYn/m]|}. Assume that gn < $A: for all n > no, and that the sequence (gn) has 
period length A2. Let A be the least common multiple of A, and A2. Prove that the 
output sequence (Zn) produced by Algorithm M has a period of length A. 


16. [M28] Let CONTENTS (A) in method (10) be (a1a2...ax%)2 in binary notation. Show 
that the generated sequence of low-order bits Xo, X1, ... satisfies the relation 


Xn = (a1Xn-1 + a2Xn-2 Sa anrXn—k) mod 2. 


[This may be regarded as another way to define the sequence, although the connection 
between this relation and the efficient code (10) is not apparent at first glance!] 


17. [M33] (M. H. Martin, 1934.) Let m and k be positive integers, and let Xı = 


X2=:::= Xk = 0. For all n > 0, set Xn+% equal to the largest nonnegative value 
y < m such that the k-tuple (Xn41,...,Xn+k-1,y) has not already occurred in the 
sequence; in other words, (Xn41,.--;Xn+r—-1,y) must differ from (X741,...,Xr+x) 


for 0 < r < n. In this way, each possible k-tuple will occur at most once in the 
sequence. Eventually the process will terminate, when we reach a value of n such that 
(Xn41,--+;Xn4+r-1,y) has already occurred in the sequence for all nonnegative y < m. 
For example, if m = k = 3 the sequence is 00022212202112102012001110100, and the 
process terminates at this point. (a) Prove that when the sequence terminates, we have 
Xn+1 = = Xn+x-1 = 0. (b) Prove that every k-tuple (a1, a2,...,a,%) of elements 
with 0 < a; < m occurs in the sequence; hence the sequence terminates when n = m*, 
[Hint: Prove that the k-tuple (a1,...,@s,0,...,0) appears, when as 4 0, by induction 
on s.] Note that if we now define f(Xn,...,Xn+e-1) = Xnțk forl <n< m*, setting 
Ximk+p = 9, we obtain a function of maximum possible period. 

18. [M22] Let (Xn) be the sequence of bits generated by method (10), with k = 35 
and CONTENTS(A) = (00000000000000000000000000000000101)2. Let Un be the binary 
fraction (.XnkXnk+1.-.Xnk+k—1)2; show that this sequence (Un) fails the serial test 
on pairs (Section 3.3.2B) when d = 8. 
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19. [M41] For each prime p specified in the first column of Table 2 in Section 4.5.4, 
find suitable constants a; and az as suggested in the text, such that the period length 
of (8), when k = 2, is p? — 1. (See Eq. 3.3.4-(39) for an example.) 


20. [M40] Calculate constants suitable for use as CONTENTS (A) in method (10), having 
approximately the same number of zeros as ones, for 2 < k < 64. 


21. [M35] (D. Rees.) The text explains how to find functions f such that the sequence 
(11) has period length m” — 1, provided that m is prime and Xo,...,Xz—1 are not all 
zero. Show that such functions can be modified to obtain sequences of type (11) with 
period length m*, for all integers m. [Hints: Consider the results of exercises 7 and 13, 
and sequences such as (pXan + X2n41).] 


22. [M24] The text restricts discussion of the extended linear sequences (8) to the 
case that m is prime. Prove that reasonably long periods can also be obtained when m 
is “squarefree,” that is, the product of distinct primes. (Examination of Table 3.2.1.1-1 
shows that m = w +1 often satisfies this hypothesis; many of the results of the text 
can therefore be carried over to that case, which is somewhat more convenient for 
calculation.) 


23. [20] Discuss the sequence defined by Xn = (Xn—55 — Xn—24) mod m as an alter- 
native to (7). 

24. [M20] Let 0 < l< k. Prove that the sequence of bits defined by the recurrence 
Xn = (Xn-k+1 + Xn—z) mod 2 has period length 2* — 1 whenever the sequence defined 
by Yn = (Yn-1 + Yn—z) mod 2 does. 

25. [26] Discuss the alternative to Program A that changes all 55 entries of the Y 
table every 55th time a random number is required. 


26. [M48] (J.F. Reiser.) Let p be prime and let k be a positive integer. Given integers 
Q1,...,@% and 21,...,2x%, let Aq be the period of the sequence (Xn) generated by the 
recurrence 


Xn = ztn modpř, 0<n< k; Xn = (a1 Xn-1 +*+- +akXn-k) mod př, n> k; 


and let Na be the number of Os that occur in the period (the number of indices j such 
that wa < j < Ha + àa and X; = 0). Prove or disprove the following conjecture: 
There exists a constant c (depending possibly on p and k and ai,...,ax) such that 
Na < cp%#-2)/&-Y for all a and all x1,..., £p. 

[Notes: Reiser has proved that if the recurrence has maximum period length mod p 
(that is, if Ay = pe — 1), and if the conjecture holds, then the k-dimensional discrepancy 
of (Xn) will be O(a*p~*/*-)) as a => 00; thus an additive generator like (7) would 
be well distributed in 55 dimensions, when m = 2° and the entire period is considered. 
(See Section 3.3.4 for the definition of discrepancy in k dimensions.) The conjecture 
is a very weak condition, for if (X,) takes on each value about equally often and if 
Aa = p° (p — 1), the quantity Na ~ (p* — 1)/p does not grow at all as a increases. 
Reiser has verified the conjecture for k = 3. On the other hand he has shown that it 
is possible to find unusually bad starting values x1,...,x2,% (depending on a) so that 
Now > p*, provided that A. = p°~'(p* — 1) and k > 3 and a is sufficiently large.] 


27. [M30] Suppose Algorithm B is being applied to a sequence (Xn) whose period 
length is A, where À >> k. Show that for fixed k and all sufficiently large A, the output 
of the sequence will eventually be periodic with the same period length A, unless (Xn) 
isn’t very random to start with. [Hint: Find a pattern of consecutive values of |kXn/m| 
that causes Algorithm B to “synchronize” its subsequent behavior. ] 
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28. [40] (A. G. Waterman.) Experiment with linear congruential sequences with m 
the square or cube of the computer word size, while a and c are single-precision numbers. 


29. [40] Find a good way to compute the function f(x1,..., £) defined by Martin’s 


sequence in exercise 17, given only the k-tuple (x1,..., £k). 
30. [M37] (R. P. Brent.) Let f(x) = zë —a:a*—'—---—ay be a primitive polynomial 
modulo 2, and suppose that Xo, ..., Xx—1 are integers not all even. 


a) Prove that the period of the recurrence Xn = (a1Xn-1 +--+: + axXn—x) mod 2° 
is 2°-1(2* — 1) for all e > 1 if and only if f(x)? + f(—ax)? Z 2f(a?) and f(x)? + 
f(—a)? # 2(-1)* f(—2?) (modulo 8). [Hint: We have 22° = —x (modulo 4 and 
f(x)) if and only if f(x)? + f(—a)? = 2f(a?) (modulo 8).] 

b) Prove that this condition always holds when the polynomial f(x) = x +a! +1 is 
primitive modulo 2 and k > 2. 


31. [M30] (G. Marsaglia.) What is the period length of the sequence (7’) when m = 
2° > 8? Assume that Xo, ..., X54 are not all = +1 (modulo 8). 


32. [M21] What recurrences are satisfied by the elements of the subsequences (X2n) 
and (X3n), when Xn = (Xn—24 + Xn—55) mod m? 


33. [M23] (a) Let gn(z) = Xn+30 +Xn+292 +: -+Xn2”? + Xngsae + +Xngs124, 
where the X’s satisfy the lagged Fibonacci recurrence (7). Find a simple relation 
between gn(z) and gn4+(z). (b) Express Xsoo in terms of Xo, ..., X54. 


34. [M25] Prove that the inversive congruential sequence (12) has period p+1 if and 
only if the polynomial f(x) = z? — cz — a has the following two properties: (i) z?*' mod 
f(x) is a nonzero constant, when computed with polynomial arithmetic modulo p; 
(ii) c®+Y/4 mod f(x) has degree 1 for every prime q that divides p+1. [Hint: Consider 
powers of the matrix (° 1) 


35. [HM35| How many pairs (a,c) satisfy the conditions of exercise 34? 


36. [M25] Prove that the inversive congruential sequence Xn41 = (aX, ' +c) mod 2°, 
Xo = 1, e > 3, has period length 2°~' whenever a mod 4 = 1 and cmod 4 = 2. 

37. [HM32] Let p be prime and assume that Xn41 = (aX;,' + c) mod p defines an 
inversive congruential sequence of period p+ 1. Also let 0 < bı < --- < ba < p, and 
consider the set 


V = {(Xn401, Xn+b2;- - -, Xn+b4) |0 < n < p and Xn45; # œ for 1 < j < d}. 


This set contains p + 1 — d vectors, any d of which lie in some (d — 1)-dimensional 
hyperplane H = {(v1,..., va) | r1v1 +-+- +rava = ro (modulo p)}, where (r1,..., ra) É 
(0,...,0). Prove that no d+ 1 vectors of V lie in the same hyperplane. 
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3.3. STATISTICAL TESTS 


OUR MAIN PURPOSE is to obtain sequences that behave as if they are random. So 
far we have seen how to make the period of a sequence so long that for practical 
purposes it never will repeat; this is an important criterion, but it by no means 
guarantees that the sequence will be useful in applications. How then are we to 
decide whether a sequence is sufficiently random? 

If we were to give some randomly chosen man a pencil and paper and ask him 
to write down 100 random decimal digits, chances are very slim that he would 
produce a satisfactory result. People tend to avoid things that seem nonrandom, 
such as pairs of equal adjacent digits (although about one out of every 10 digits 
should equal its predecessor). And if we would show that same man a table of 
truly random digits, he would quite probably tell us they are not random at all; 
his eye would spot certain apparent regularities. 

According to Dr. I. J. Matrix and Donald C. Rehkopf (as quoted by Martin 
Gardner in Scientific American, January, 1965), “Mathematicians consider the 
decimal expansion of 7 a random series, but to a modern numerologist it is rich 
with remarkable patterns.” Dr. Matrix has pointed out, for example, that the 
first repeated two-digit number in m’s expansion is 26, and its second appearance 
comes in the middle of a curious repetition pattern: 


S A 
3.14159265358979323846264338327950 (1) 
TI C2 T 


After listing a dozen or so further properties of these digits, he observed that 7, 
when correctly interpreted, conveys the entire history of the human race! 

We all notice patterns in our telephone numbers, license numbers, etc., as 
aids to memory. The point of these remarks is that we cannot be trusted to judge 
by ourselves whether a sequence of numbers is random or not. Some unbiased 
mechanical tests must be applied. 

The theory of statistics provides us with some quantitative measures for 
randomness. There is literally no end to the number of tests that can be 
conceived; we will discuss the tests that have proved to be most useful, most 
instructive, and most readily adapted to computer calculation. 

If a sequence behaves randomly with respect to tests Ti, Tb, ..., Tn, we 
cannot be sure in general that it will not be a miserable failure when it is 
subjected to a further test T;,41. Yet each test gives us more and more confidence 
in the randomness of the sequence. In practice, we apply about half a dozen 
different kinds of statistical tests to a sequence, and if it passes them satisfactorily 
we consider it to be random — it is then presumed innocent until proven guilty. 

Every sequence that is to be used extensively should be tested carefully, so 
the following sections explain how to administer the tests in an appropriate way. 
Two kinds of tests are distinguished: empirical tests, for which the computer 
manipulates groups of numbers of the sequence and evaluates certain statistics; 
and theoretical tests, for which we establish characteristics of the sequence by 
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using number-theoretic methods based on the recurrence rule used to form the 
sequence. 

If the evidence doesn’t come out as desired, the reader may wish to try the 
techniques in How to Lie With Statistics by Darrell Huff (Norton, 1954). 


3.3.1. General Test Procedures for Studying Random Data 


A. “Chi-square” tests. The chi-square test (x? test) is perhaps the best 
known of all statistical tests, and it is a basic method that is used in connection 
with many other tests. Before considering the idea in general, let us consider a 
particular example of the chi-square test as it might be applied to dice throwing. 
Using two “true” dice (each of which, independently, is assumed to yield the 
values 1, 2, 3, 4, 5, or 6 with equal probability), the following table gives the 
probability of obtaining a given total, s, on a single throw: 


valueofs= 2 3 4 5 6 7 8 9 10 11 12 
1 1 1 1 5 5 1 1 1 1 (1) 


ia =. 1 
probability, ps = 35 ig 13 9 36 6 36 9 B2 B 36 


For example, a value of 4 can be thrown in three ways: 1+ 3, 2+ 2, 3+ 1; this 
constitutes 3 = + = p4 of the 36 possible outcomes. 

If we throw the dice n times, we should obtain the value s approximately 
Nps times on the average. For example, in 144 throws we should get the value 4 
about 12 times. The following table shows what results were actually obtained 


in a particular sequence of 144 throws of the dice: 


valueofs=2 3 4 5 6 7 8 9 10 11 12 
observed number, Y, = 2 4 10 12 22 29 21 15 14 9 6 (2) 
expected number, np, = 4 8 12 16 20 24 20 16 12 8 4 


Notice that the observed number was different from the expected number in all 
cases; in fact, random throws of the dice will hardly ever come out with exactly 
the right frequencies. There are 36!44 possible sequences of 144 throws, all of 
which are equally likely. One of these sequences consists of all 2s (“snake eyes”), 
and anyone throwing 144 snake eyes in a row would be convinced that the dice 
were loaded. Yet the sequence of all 2s is just as probable as any other particular 
sequence if we specify the outcome of each throw of each die. 

In view of this, how can we test whether or not a given pair of dice is loaded? 
The answer is that we can’t make a definite yes-no statement, but we can give 
a probabilistic answer. We can say how probable or improbable certain types of 
events are. 

A fairly natural way to proceed in the example above is to consider the 
squares of the differences between the observed numbers Y, and the expected 
numbers nps. We can add these together, obtaining 


V = (Yə — np)? + (Y3 — nps)? +--+ + (Yi2 — npie)?. (3) 


A bad set of dice should result in a relatively high value of V; and for any given 
value of V we can ask, “What is the probability that V is this high, using true 
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dice?” If this probability is very small, say To we would know that only about 
one time in 100 would true dice give results so far away from the expected num- 
bers, and we would have definite grounds for suspicion. (Remember, however, 
that even good dice would give such a high value of V about one time in a 
hundred, so a cautious person would repeat the experiment to see if the high 
value of V is repeated.) 

The statistic V in (3) gives equal weight to (Y7 — np7)? and (Y2 — np2)?, 
although (Y7 — np7)? is likely to be a good deal higher than (Y2 — np2)? since 7s 
occur about six times as often as 2s. It turns out that the “right” Bhanisule, at 
least one that has proved to be most important, will give (Y7 — np7)? only 4 = as 
much weight as (Y> — np2)?, and we should change (3) to the following formula: 


Yo — np2)? Y3 — np3)? Yiz — n 2 
y- © p2) y s p3) prag Vu Piz) (4) 
mp2 NP3 NP12 
This is called the “chi-square” statistic of the observed quantities Y2,..., Y1 in 


the dice-throwing experiment. For the data in (2), we find that 


_ (2-4? 4-8? |, @-8)? , 6-4? _,7 
Sg ter t ee SD 


The important question now is, of course, “Does 74 constitute an improbably 
high value for V to assume?” Before answering this question, let us consider the 
general application of the chi-square method. 

In general, suppose that every observation can fall into one of k categories. 
We take n independent observations; this means that the outcome of one obser- 
vation has absolutely no effect on the outcome of any of the others. Let p, be the 
probability that each observation falls into category s, and let Y, be the number 
of observations that actually do fall into category s. We form the statistic 


k 
V= pes = Ps)” (6) 


s=1 


In our example above, there are eleven possible outcomes of each throw of the 
dice, so k = 11. (Eq. (6) is a slight change of notation from Eq. (4), since we 
are numbering the possibilities from 1 to k instead of from 2 to 12.) 

By expanding (Y, — np,)? = Y2 — 2np.Y, + n?p? in (6), and using the facts 
that 


Yit Yatt Yk =n, 
_ (7) 
Morr pe = 1, 
we arrive at the formula 
k 
1 Y? 
V=- (=) =n, (8) 
uh s=1 Ps 


which often makes the computation of V somewhat easier. 
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Table 1 
SELECTED PERCENTAGE POINTS OF THE CHI-SQUARE DISTRIBUTION 


p=1% | p=5% | p= 25% | p= 50% | p= 75% | p= 95% | p = 99% 

v=1 0.00016 | 0.00393 | 0.1015 0.4549 1.323 3.841 6.635 
v=2 0.02010 | 0.1026 0.5754 1.386 2.773 5.991 9.210 
v=3 0.1148 0.3518 1.213 2.366 4.108 7.815 11.34 
v=4 0.2971 0.7107 1.923 3.357 5.385 9.488 13.28 
v=5 0.5543 1.1455 2.675 4.351 6.626 11.07 15.09 
v=6 0.8721 1.635 3.455 5.348 7.841 12.59 16.81 
v=7 1.239 2.167 4.255 6.346 9.037 14.07 18.48 
v=8 1.646 2.733 5.071 7.344 10.22 15.51 20.09 
v=9 2.088 3.325 5.899 8.343 11.39 16.92 21.67 
v= 10 2.558 3.940 6.737 9.342 12.55 18.31 23.21 
v=11 3.053 4.575 7.584 10.34 13.70 19.68 24.72 
v = 12 3.571 5.226 8.438 11.34 14.85 21.03 26.22 
v= 15 5.229 7.261 11.04 14.34 18.25 25.00 30.58 
v = 20 8.260 10.85 15.45 19.34 23.83 31.41 37.57 
v = 30 | 14.95 18.49 24.48 29.34 34.80 43.77 50.89 
v = 50 | 29.71 34.76 42.94 49.33 56.33 67.50 76.15 
v > 30 v + V2vap + 205-24 0(1/VV) 

Lp = —2.33 —1.64 —.674 0.00 0.674 1.64 2.33 


(For further values, see Handbook of Mathematical Functions, edited by M. Abramowitz and 
I. A. Stegun (Washington, D.C.: U.S. Government Printing Office, 1964), Table 26.8. See also 
Eq. (22) and exercise 16.) 


Now we turn to the important question, “What constitutes a reasonable 
value of V?” This is found by referring to a table such as Table 1, which gives val- 
ues of “the chi-square distribution with v degrees of freedom” for various values 
of v. The line of the table with v = k—1 is to be used; the number of “degrees of 
freedom” is k—1, one less than the number of categories. (Intuitively, this means 
that Y1, Y2,..., Yp are not completely independent, since Eq. (7) shows that Yp 
can be computed if Y1, ..., Yk—1ı are known; hence, k — 1 degrees of freedom are 
present. This argument is not rigorous, but the theory below justifies it.) 

If the table entry in row v under column p is x, it means, “The quantity V 
in Eq. (8) will be less than or equal to x with approximate probability p, if n 
is large enough.” For example, the 95 percent entry in row 10 is 18.31; we will 
have V > 18.31 only about 5 percent of the time. 
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Let us assume that our dice-throwing experiment has been simulated on a 
computer using some sequence of supposedly random numbers, with the following 
results: 

valueofs= 2 3 4 5 6 7 8 9 10 11 12 
Experiment 1, Y;= 4 10 10 13 20 18 18 11 13 14 13 (9) 
Experiment 2, Y= 3 7 11 15 19 24 21 17 13 9 5 


We can compute the chi-square statistic in the first case, getting the value V; = 
2952, and in the second case we get V2 = 145. Referring to the table entries 
for 10 degrees of freedom, we see that Vi is much too high; V will be greater than 
23.21 only about one percent of the time! (By using more extensive tables, we 
find in fact that V will be as high as Vı only 0.1 percent of the time.) Therefore 
Experiment 1 represents a significant departure from random behavior. 

On the other hand, V2 is quite low, since the observed values Y, in Exper- 
iment 2 are quite close to the expected values np, in (2). The chi-square table 
tells us, in fact, that V2 is much too low: The observed values are so close to the 
expected values, we cannot consider the result to be random! (Indeed, reference 
to other tables shows that such a low value of V occurs only 0.03 percent of 
the time when there are 10 degrees of freedom.) Finally, the value V = T% 
computed in (5) can also be checked with Table 1. It falls between the entries 
for 25 percent and 50 percent, so we cannot consider it to be significantly high 
or significantly low; thus the observations in (2) are satisfactorily random with 
respect to this test. 

It is somewhat remarkable that the same table entries are used no matter 
what the value of n is, and no matter what the probabilities ps are. Only the 
number v = k — 1 affects the results. In actual fact, however, the table entries 
are not exactly correct: The chi-square distribution is an approximation that is 
valid only for large enough values of n. How large should n be? A common rule 
of thumb is to take n large enough so that each of the expected values nps is 
five or more; preferably, however, take n much larger than this, to get a more 
powerful test. In our examples above we took n = 144, so npọ was only 4, 
violating the stated rule of thumb. This was done only because the author 
tired of throwing the dice; it makes the entries in Table 1 less accurate for our 
application. Experiments run on a computer, with n = 1000, or 10000, or even 
100000, would be much better than this. We could also combine the data for 
s = 2 and s = 12; then the test would have only nine degrees of freedom but the 
chi-square approximation would be more accurate. 

We can get an idea of how crude an approximation is involved by considering 
the case when there are only two categories, having probabilities pı and pə. 
Suppose pı = 4 and pọ = 3. According to the stated rule of thumb, we should 
have n > 20 to have a satisfactory approximation, so let’s check that out. When 
n = 20, the possible values of V are (Y; — 5)?/5 + (5 — Y1)?/15 = jr? for 
—5 < r < 15; we wish to know how well the row v = 1 of Table 1 describes 
the distribution of V. The chi-square distribution varies continuously, while the 
actual distribution of V has rather big jumps, so we need some convention for 
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representing the exact distribution. If the distinct possible outcomes of the 
experiment lead to the values Vo < Vı < --- < Vn with respective proba- 
bilities 7, 71, ..., Tn, Suppose that a given percentage p falls in the range 
To tees + Tj- <p < Tot: + Trj- +r. We would like to represent p by a 
“percentage point” x such that V is less than x with probability < p and V is 
greater than x with probability < 1—p. It is not difficult to see that the only such 
number is x = V;. In our example for n = 20 and v = 1, it turns out that the 
percentage points of the exact distribution, corresponding to the approximations 
in Table 1 for p= 1%, 5%, 25%, 50%, 75%, 95%, and 99%, respectively, are 


0, 0, .27, .27, 1.07, 4.27, 6.67 


(to two decimal places). For example, the percentage point for p = 95% is 4.27, 
while Table 1 gives the estimate 3.841. The latter value is too low; it tells us 
(incorrectly) to reject the value V = 4.27 at the 95% level, while in fact the 
probability that V > 4.27 is more than 6.5%. When n = 21, the situation 
changes slightly because the expected values np, = 5.25 and npg = 15.75 can 
never be obtained exactly; the percentage points for n = 21 are 


02, .02, .14, 40, 1.29, 3.57, 5.73. 


We would expect Table 1 to be a better approximation when n = 50, but 
the corresponding tableau actually turns out to be further from Table 1 in some 
respects than it was for n = 20: 


.03, .03, .03, .67, 1.31, 3.23, 6. 
Here are the values when n = 300: 
0, 0, .07, .44, 1.44, 4, 6.42. 


Even in this case, when np, is > 75 in each category, the entries in Table 1 are 
good to only about one significant digit. 

The proper choice of n is somewhat obscure. If the dice are actually biased, 
the fact will be detected as n gets larger and larger. (See exercise 12.) But large 
values of n will tend to smooth out locally nonrandom behavior, when blocks of 
numbers with a strong bias are followed by blocks of numbers with the opposite 
bias. Locally nonrandom behavior is not an issue when actual dice are rolled, 
since the same dice are used throughout the test, but a sequence of numbers 
generated by computer might very well display such anomalies. Perhaps a chi- 
square test should be made for several different values of n. At any rate, n should 
always be rather large. 


We can summarize the chi-square test as follows. A fairly large number, n, of 
independent observations is made. (It is important to avoid using the chi-square 
method unless the observations are independent. See, for example, exercise 10, 
which considers the case when half of the observations depend on the other 
half.) We count the number of observations falling into each of k categories and 
compute the quantity V given in Eqs. (6) and (8). Then V is compared with the 
numbers in Table 1, with v = k — 1. If V is less than the 1% entry or greater 
than the 99% entry, we reject the numbers as not sufficiently random. If V lies 
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A B c D E F 
O OO| © 
O O © 
O OJO O |@@@ 
O @ o © 
O O' |00 
Range of V Indication Code 
0-1 percent, 99-100 percent Reject oO 
1-5 percent, 95-99 percent Suspect 
5-10 percent, 90-95 percent Almost suspect O 


Fig. 2. Indications of “significant” deviations in 90 chi-square tests (see also Fig. 5). 


between the 1% and 5% entries or between the 95% and 99% entries, the numbers 
are “suspect”; if (by interpolation in the table) V lies between the 5% and 10% 
entries, or the 90% and 95% entries, the numbers might be “almost suspect.” 
The chi-square test is often done at least three times on different sets of data, 
and if at least two of the three results are suspect the numbers are regarded as 
not sufficiently random. 

For example, see Fig. 2, which shows schematically the results of apply- 
ing five different types of chi-square tests on each of six sequences of random 
numbers. Each test in this illustration was applied to three different blocks 
of numbers of the sequence. Generator A is the MacLaren-Marsaglia method 
(Algorithm 3.2.2M applied to the sequences in 3.2.2-(13)); Generator E is the 
Fibonacci method, 3.2.2-(5); and the other generators are linear congruential 
sequences with the following parameters: 


Generator B: Xo =0, a= 3141592653, c= 2718281829, m = 2%, 
Generator C: Xo=0, @=2'+1, c=1, m=2%. 

Generator D: Xo = 47594118, a=23, c=0, m=108 +1. 
Generator F: Xo = 314159265, a=2!8 +1, c=1, m= 2%, 


From Fig. 2 we conclude that (so far as these tests are concerned) Generators A, 
B, D are satisfactory, Generator C is on the borderline and should probably 
be rejected, Generators E and F are definitely unsatisfactory. Generator F 
has, of course, low potency; Generators C and D have been discussed in the 
literature, but their multipliers are too small. (Generator D is the original 
multiplicative generator proposed by Lehmer in 1948; Generator C is the original 
linear congruential generator with c 4 0 proposed by Rotenberg in 1960.) 

Instead of using the “suspect,” “almost suspect,” etc., criteria for judging 
the results of chi-square tests, one can employ a less ad hoc procedure discussed 
later in this section. 
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(a) (b) 


x=3.9 | x=9.3 x= 18.3 
x=6.7 x= 12.6 


(c) 


Fig. 3. Examples of distribution functions. 


B. The Kolmogorov—Smirnov test. As we have seen, the chi-square test 
applies to the situation when observations can fall into a finite number of cate- 
gories. It is not unusual, however, to consider random quantities that range over 
infinitely many values, such as a random fraction (a random real number between 
0 and 1). Even though only finitely many real numbers can be represented in a 
computer, we want our random values to behave essentially as if all real numbers 
in [0..1) were equally likely. 

A general notation for specifying probability distributions, whether they 
are finite or infinite, is commonly used in the study of probability and statistics. 
Suppose we want to specify the distribution of the values of a random quantity, X; 
we do this in terms of the distribution function F(x), where 


F(x) = Pr(X < x) = probability that (X < x). 


Three examples are shown in Fig. 3. First we see the distribution function for a 
random bit, namely for the case when X takes on only the two values 0 and 1, 
each with probability $. Part (b) of the figure shows the distribution function 
for a uniformly distributed random real number between zero and one; here the 
probability that X < x is simply equal to x when 0 < x < 1. For example, 
the probability that X < Z is, naturally, 2. And part (c) shows the limiting 
distribution of the value V in the chi-square test (shown here with 10 degrees of 
freedom); this is a distribution that we have already seen represented in another 
way in Table 1. Notice that F(x) always increases from 0 to 1 as z increases 
from —co to +00. 
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If we make n independent observations of the random quantity X, thereby 
obtaining the values X1, X2, ..., Xn, we can form the empirical distribution 
function F;,(a), where 


number of X1, X2,..., Xn that are < x 
Fy, (2) = i (10) 


n 


Figure 4 illustrates three empirical distribution functions (shown as zigzag lines, 
although strictly speaking the vertical lines are not part of the graph of Fn (x)), 
superimposed on a graph of the assumed actual distribution function F(x). As 
n gets large, F,(x) should be a better and better approximation to F(x). 


(a) 


5% 25%50% 75% 95% 99% 


(b) 


5% 25%50% 75% 95% 99% 


(c) 


Fig. 4. Examples of 
empirical distributions. 
The x value marked “5%” 
is the percentage point 
where F(x) = 0.05. 


5% 25%50% 75% 95% 99% 


The Kolmogorov-Smirnov test (KS test) may be used when F(x) has no 
jumps. It is based on the difference between F(x) and F,,(a). A bad source of 
random numbers will give empirical distribution functions that do not approxi- 
mate F(x) sufficiently well. Figure 4(b) shows an example in which the X; are 
consistently too high, so the empirical distribution function is too low. Part (c) 
of the figure shows an even worse example; it is plain that such great deviations 
between F(x) and F(x) are extremely improbable, and the KS test is used to 
tell us how improbable they are. 
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To make the KS test, we form the following statistics: 
Ki =n sup (F(x) = F(z)); 


—oo<r<+00 


K,=vn sup (F(@)—F,(2). 


—oo<r<+00 


(11) 


Here Ký measures the greatest amount of deviation when Fn is greater than F, 
and K; measures the maximum deviation when Fp is less than F. The statistics 
for the examples of Fig. 4 are 


Fig. 4(a) Fig. 4(b) Fig. 4(c) 
K4 0.492 0.134 0.313 (12) 
Kz 0.536 1.027 2.101 


(Note: The factor yn that appears in Eqs. (11) may seem puzzling at first. 
Exercise 6 shows that, for fixed zx, the standard deviation of F(x) is proportional 
to 1/,/n; hence the factor yn magnifies the statistics K} and K3 in such a way 
that this standard deviation is independent of n.) 

As in the chi-square test, we may now look up the values K}, Ky in a 
percentile table to determine if they are significantly high or low. Table 2 may 
be used for this purpose, both for K} and K}. For example, the probability is 
75 percent that M5, will be 0.7975 or less. Unlike the chi-square test, the table 
entries are not merely approximations that hold for large values of n; Table 2 
gives exact values (except, of course, for roundoff error), and the KS test may 
be used reliably for any value of n. 

As they stand, formulas (11) are not readily adapted to computer calcula- 
tion, since we are asking for a least upper bound over infinitely many values of x. 
But from the fact that F(x) is increasing and the fact that Fa(x) increases only 
in finite steps, we can derive a simple procedure for evaluating the statistics K7 
and K3: 

Step 1. Obtain independent observations X1, X2,...,Xn. 


Step 2. Rearrange the observations so that they are sorted into ascending order, 
Xı < X2 <- < Xn. (Efficient sorting algorithms are the subject of Chapter 5. 
But it is possible to avoid sorting in this case, as shown in exercise 23.) 


Step 3. The desired statistics are now given by the formulas 


Ki = vn max (2 - F(x); 


lsjsn\n 
j-i (13) 
K, = vi max (F(X) ES ). 
An appropriate choice of the number of observations, n, is slightly easier to 
make for this test than it is for the x? test, although some of the considerations 
are similar. If the random variables X; actually belong to the probability 
distribution G(x), while they were assumed to belong to the distribution given 
by F(x), we want n to be comparatively large, in order to reject the hypothesis 
that G(x) = F(a); for we need n large enough that the empirical distributions 


3.3.1 GENERAL TEST PROCEDURES 51 
Table 2 
SELECTED PERCENTAGE POINTS OF THE DISTRIBUTIONS K} AND K} 

p=1% | p=5% | p= 25% | p=50% | p= 75% | p = 95% | p = 99% 
n=1 | 0.01000 | 0.05000 | 0.2500 | 0.5000 | 0.7500 | 0.9500 | 0.9900 
n=2 | 0.01400 | 0.06749 | 0.2929 | 0.5176 | 0.7071 | 1.0980 | 1.2728 
n=3 | 0.01699 | 0.07919 | 0.3112 | 0.5147 | 0.7539 | 1.1017 | 1.3589 
n=4 | 0.01943 | 0.08789 | 0.3202 | 0.5110 | 0.7642 | 1.1304 | 1.3777 
n=5 0.02152 | 0.09471 0.3249 0.5245 0.7674 1.1392 1.4024 
n=6 0.02336 | 0.1002 0.3272 0.5319 0.7703 1.1463 1.4144 
n=7 | 0.02501 | 0.1048 | 0.3280 | 0.5364 | 0.7755 | 1.1537 | 1.4246 
n=8 | 0.02650 | 0.1086 | 0.3280 | 0.5392 | 0.7797 | 1.1586 | 1.4327 
n=9 0.02786 | 0.1119 0.3274 0.5411 0.7825 1.1624 1.4388 
n=10 | 0.02912 | 0.1147 0.3297 0.5426 0.7845 1.1658 1.4440 
n=11 | 0.03028 | 0.1172 | 0.3330 | 0.5439 | 0.7863 | 1.1688 | 1.4484 
n=12 | 0.03137 | 0.1193 | 0.3357 | 0.5453 | 0.7880 | 1.1714 | 1.4521 
n=15 | 0.03424 | 0.1244 0.3412 0.5500 0.7926 1.1773 1.4606 
n= 20 | 0.03807 | 0.1298 | 0.3461 | 0.5547 | 0.7975 | 1.1839 | 1.4698 
n= 30 | 0.04354 | 0.1351 | 0.3509 | 0.5605 | 0.8036 | 1.1916 | 1.4801 

n > 30 Yp — ¿n7? + O(1/n), where y2 = 4 In(1/(1 — p)) 
Yp = | 0.07089 | 0.1601 | 0.3793 | 0.5887 | 0.8326 | 1.2239 | 1.5174 


(To extend this table, see Eqs. (25) and (26), and the answer to exercise 20.) 


Gn(x) and F(x) are expected to be observably different. On the other hand, 
large values of n will tend to average out locally nonrandom behavior, and such 
undesirable behavior is a significant danger in most computer applications of 
random numbers; this makes a case for smaller values of n. A good compromise 
would be to take n equal to, say, 1000, and to make a fairly large number of 
calculations of Kiooo on different parts of a random sequence, thereby obtaining 
values 


Kīiooo(1), Kīiooo(2), rew K'iooo(r). (14) 
We can also apply the KS test again to these results: Let F(x) now be the 
distribution function for Koo), and determine the empirical distribution F,.(x) 
obtained from the observed values in (14). Fortunately, the function F(x) in this 
case is very simple; for a large value of n like n = 1000, the distribution of K} 
is closely approximated by 


F(t) =1—-e"", «> 0. (15) 
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The same remarks apply to K}, since Kł and K; have the same expected 
behavior. This method of using several tests for moderately large n, then 
combining the observations later in another KS test, will tend to detect both 
local and global nonrandom behavior. 

For example, the author conducted the following simple experiment while 
writing this chapter: The “maximum-of-5” test described in the next section was 
applied to a set of 1000 uniform random numbers, yielding 200 observations X1, 
Xə, ..., X200 that were supposed to belong to the distribution F(x) = x° for 
0 <a<1. The observations were divided into 20 groups of 10 each, and the 
statistic Kj) was computed for each group. The 20 values of Kj) thus obtained 
led to the empirical distributions shown in Fig. 4. The smooth curve shown in 
each of the diagrams in Fig. 4 is the actual distribution the statistic Ay should 
have. Figure 4(a) shows the empirical distribution of Kj obtained from the 
sequence 


Yn41 = (3141592653Y,, + 2718281829) mod 2°°, Ur = Y2, 


and it is satisfactorily random. Part (b) of the figure came from the Fibonacci 
method; this sequence has globally nonrandom behavior — that is, it can be 
shown that the observations X,, in the maximum-of-5 test do not have the correct 
distribution F(x) = xë. Part (c) came from the notorious and impotent linear 
congruential sequence Y,;,41 = ((218 + 1)Y,, + 1) mod 235, U, = Y,,/2®°. 

The KS test applied to the data in Fig. 4 gives the results shown in (12). 
Referring to Table 2 for n = 20, we see that the values of KZ and Kọ for 
Fig. 4(b) are almost suspect (they lie at about the 5 percent and 88 percent 
levels), but they are not quite bad enough to be rejected outright. The value of 
K5p for Fig. 4(c) is, of course, completely out of line, so the maximum-of-5 test 
shows a definite failure of that random number generator. 

We would expect the KS test in this experiment to have more difficulty 
locating global nonrandomness than local nonrandomness, since the basic obser- 
vations in Fig. 4 were made on samples of only 10 numbers each. If we were 
to take 20 groups of 1000 numbers each, part (b) would show a much more 
significant deviation. To illustrate this point, a single KS test was applied to all 
200 of the observations that led to Fig. 4, and the following results were obtained: 


Fig. 4(a) Fig. 4(b) Fig. 4(c) 
Kioo 0.477 1.537 2.819 (16) 
K300 0.817 0.194 0.058 
The global nonrandomness of the Fibonacci generator has definitely been de- 
tected here. 


We may summarize the Kolmogorov—Smirnov test as follows. We are given 
n independent observations X1, ..., Xn taken from some distribution specified 
by a continuous function F(x). That is, F(x) must be like the functions shown 
in Fig. 3(b) and 3(c), having no jumps like those in Fig. 3(a). The procedure 
explained just before Eqs. (13) is carried out on these observations, and we obtain 
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the statistics Ky and K}. These statistics should be distributed according to 
Table 2. 

Some comparisons between the KS test and the x? test can now be made. 
In the first place, we should observe that the KS test may be used in conjunction 
with the x? test, to give a better procedure than the ad hoc method we mentioned 
when summarizing the x? test. (That is, there is a better way to proceed than 
to make three tests and to consider how many of the results were “suspect.” ) 
Suppose we have made, say, 10 independent x? tests on different parts of a 
random sequence, so that values Vi, V2, ..., Vio have been obtained. It is not a 
good policy simply to count how many of the V’s are suspiciously large or small. 
This procedure will work in extreme cases, and very large or very small values 
may mean that the sequence has too much local nonrandomness; but a better 
general method would be to plot the empirical distribution of these 10 values and 
to compare it to the correct distribution, which may be obtained from Table 1. 
The empirical distribution gives a clearer picture of the results of the x? tests, 
and in fact the statistics Kj) and Ky, could be determined from the empirical 
x? values as an indication of success or failure. With only 10 values or even 
as many as 100 this could all be done easily by hand, using graphical methods; 
with a larger number of V’s, a computer subroutine for calculating the chi-square 
distribution would be necessary. Notice that all 20 of the observations in Fig. 4(c) 
fall between the 5 and 95 percent levels, so we would not have regarded any of 
them as suspicious, individually; yet collectively the empirical distribution shows 
that these observations are not at all right. 

An important difference between the KS test and the chi-square test is that 
the KS test applies to distributions F(x) having no jumps, while the chi-square 
test applies to distributions having nothing but jumps (since all observations 
are divided into k categories). The two tests are thus intended for different 
sorts of applications. Yet it is possible to apply the x? test even when F(z) is 
continuous, if we divide the domain of F(x) into k parts and ignore all variations 
within each part. For example, if we want to test whether or not U1, U2,..., Un 
can be considered to come from the uniform distribution between zero and one, 
we want to test if they have the distribution F(x) = x for 0 < x < 1. This is 
a natural application for the KS test. But we might also divide up the interval 
from 0 to 1 into k = 100 equal parts, count how many U’s fall into each part, 
and apply the chi-square test with 99 degrees of freedom. There are not many 
theoretical results available at the present time to compare the effectiveness of 
the KS test versus the chi-square test. The author has found some examples in 
which the KS test pointed out nonrandomness more clearly than the y? test, and 
others in which the x? test gave a more significant result. If, for example, the 
100 categories mentioned above are numbered 0, 1, ..., 99, and if the deviations 
from the expected values are positive in compartments 0 to 49 but negative in 
compartments 50 to 99, then the empirical distribution function will be much 
further from F(x) than the x? value would indicate; but if the positive deviations 
occur in compartments 0, 2, ..., 98 and the negative ones occur in 1, 3, ..., 99, 
the empirical distribution function will tend to hug F(x) much more closely. The 
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Fig. 5. The KS tests applied to the same data as Fig. 2. 


kinds of deviations measured are therefore somewhat different. A x? test was 
applied to the 200 observations that led to Fig. 4, with k = 10, and the respective 
values of V were 9.4, 17.7, and 39.3; so in this particular case the values were 
quite comparable to the KS values given in (16). Since the y? test is intrinsically 
less accurate, and since it requires comparatively large values of n, the KS test 
has several advantages when a continuous distribution is to be tested. 

A further example will also be of interest. The data that led to Fig. 2 
were chi-square statistics based on n = 200 observations of the maximum-of-t 
criterion for 1 < t < 5, with the range divided into 10 equally probable parts. 
KS statistics A399 and Kzoo can be computed from exactly the same sets of 200 
observations, and the results can be tabulated in just the same way as we did 
in Fig. 2 (showing which KS values are beyond the 99-percent level, etc.); the 
results in this case are shown in Fig. 5. Notice that Generator D (Lehmer’s 
original method) shows up very badly in Fig. 5, while chi-square tests on the 
very same data revealed no difficulty in Fig. 2; contrariwise, Generator E (the 
Fibonacci method) does not look so bad in Fig. 5. The good generators, A 
and B, passed all tests satisfactorily. The reasons for the discrepancies between 
Fig. 2 and Fig. 5 are primarily that (a) the number of observations, 200, is really 
not large enough for a powerful test, and (b) the “reject,” “suspect,” “almost 
suspect” ranking criterion is itself suspect. 

(Incidentally, it is not fair to blame Lehmer for using a “bad” random 
number generator in the 1940s, since his actual use of Generator D was quite 
valid. The ENIAC computer was a highly parallel machine, programmed by 
means of a plugboard; Lehmer set it up so that one of its accumulators was 
repeatedly multiplying its own contents by 23 (modulo 108 + 1), yielding a 
new value every few milliseconds. Since this multiplier 23 is too small, we 
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know that each value obtained by such a process is too strongly related to 
the preceding value to be considered sufficiently random; but the durations 
of time between actual uses of the values in the special accumulator by the 
accompanying program were comparatively long and subject to some fluctuation. 
So the effective multiplier was 23* for large, varying values of k.) 


C. History, bibliography, and theory. The chi-square test was introduced by 
Karl Pearson in 1900 [Philosophical Magazine, Series 5, 50, 157-175]. Pearson’s 
important paper is regarded as one of the foundations of modern statistics, since 
before that time people would simply plot experimental results graphically and 
assert that they were correct. In his paper, Pearson gave several interesting 
examples of the previous misuse of statistics; and he also proved that certain 
runs at roulette (which he had experienced during two weeks at Monte Carlo in 
1892) were so far from the expected frequencies that odds against the assumption 
of an honest wheel were some 107° to one! A general discussion of the chi-square 
test and an extensive bibliography appear in the survey article by William G. 
Cochran, Annals Math. Stat. 23 (1952), 315-345. 
Let us now consider a brief derivation of the theory behind the chi-square 
test. The exact probability that Yı = y1,..., Yk = yx is easily seen to be 
a Ph Pt (17) 
If we assume that Y, has the value y, with the Poisson probability 
e~ "Ps (np,)¥s 
ys! 


bs 


and that the Y’s are independent, then (Y1,..., Yp) will equal (y1,..., yx) with 
probability 


Il e—"Ps (nyp,)¥s 
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sel Ys: 


and Yı +---+ Yp will equal n with probability 
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If we assume that they are independent except for the condition Yj+---+Y;, =n, 
the probability that (Y1, ..., Yk) = (yi,---, Yk) is the quotient 


(I ee) y a l 


s=1 


which equals (17). We may therefore regard the Y’s as independently Poisson 
distributed, except for the fact that they have a fixed sum. 
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It is convenient to make a change of variables, 


Y; ~ sS 
goa ae (18) 
NPs 


so that V = z2 + + z2. The condition Yı +---+ Yk = n is equivalent to 


requiring that 

vpi Zi +++ + VPk Zk =0. (19) 
Let us consider the (k — 1)-dimensional space S of all vectors (Z1,..., Zk) 
such that (19) holds. For large values of n, each Zs has approximately the 
normal distribution (see exercise 1.2.10-15); therefore points in a differential 
volume dz2...dzķ of S occur with probability approximately proportional to 
exp (—(z? + --- + z2)/2). (It is at this point in the derivation that the chi-square 
method becomes only an approximation for large n.) The probability that V < v 
is now 


fea ere zk) in Sand 2pte-tz2<v exp (-(22 Tae zp) /2) dzz EE dzp 
Ties m zx)ing CXP (—(2? +--+ + 22) /2) dzo... dz 


Since the hyperplane (19) passes through the origin of k-dimensional space, the 
numerator in (20) is an integration over the interior of a (k — 1)-dimensional 
hypersphere centered at the origin. An appropriate transformation to generalized 
polar coordinates with radius y and angles w1, ..., wz—2 transforms (20) into 


(20) 


Jz EHX /2yk—2 Flw, TE ,Wk—2) dx dw rE dwk—2 
f eWX7/2yk—-2 f (w, E ,Wk—2) dx dw was dwk—2 


for some function f (see exercise 15); then integration over the angles w1, ..., 
Wr—2 gives a constant factor that cancels from numerator and denominator. We 
finally obtain the formula 


Ta e7 X?/2yk-2 dx 
Jo eryd 


for the approximate probability that V < v. 

Our derivation of (21) uses the symbol x to stand for the radial length, 
just as Pearson did in his original paper; this is how the y? test got its name. 
Substituting t = 7/2, the integrals can be expressed in terms of the incomplete 
gamma function, which we discussed in Section 1.2.11.3: 


mevo E 


This is the definition of the chi-square distribution with k— 1 degrees of freedom. 
We now turn to the KS test. In 1933, A. N. Kolmogorov proposed a test 
based on the statistic 


Kn = Vn _ max |F. (x) — F(x)| = max( K}, Kz). (23) 


(21) 
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N. V. Smirnov discussed several modifications of this test in 1939, including the 
individual examination of Kł and K; as we have suggested above. There is 
a large family of similar tests, but the K} and K} statistics seem to be most 
convenient for computer application. A comprehensive review of the literature 
concerning KS tests and their generalizations, including an extensive bibliogra- 
phy, appears in a monograph by J. Durbin, Regional Conf. Series on Applied 
Math. 9 (SIAM, 1973). 

To study the distribution of Ký and K}, we begin with the following basic 
fact: If X is a random variable with the continuous distribution F(x), then F(X) 
is a uniformly distributed real number between 0 and 1. To prove this, we need 
only verify that if 0 < y < 1 we have F(X) < y with probability y. Since F is 
continuous, Fao) = y for some xo; thus the probability that F(X) < y is the 
probability that X < xo. By definition, the latter probability is F(x), that is, 
it is y. 

Let Y; = nF(X;), for 1 < j < n, where the X’s have been sorted as in 
Step 2 preceding Eq. (13). Then the variables Y; are essentially the same as 
independent, uniformly distributed random numbers between 0 and n that have 
been sorted into nondecreasing order, Yı < Yə < --- < Yn; and the first equation 
of (13) may be transformed into 


1 
K} = ao ee o n — Yn). 
If 0 < t< n, the probability that K} < t/,/n is therefore the probability that 
Yj; > j—tfor1 <j <n. This is not hard to express in terms of n-dimensional 
integrals, 


Ja dyn J dyn-1 Aeros oa dy 
IM dyn lig dyn—1 -.. fe dy, ° 


The denominator here is immediately evaluated: It is found to be n”/n!, which 
makes sense since the hypercube of all vectors (yi, y2,---,Yn) with 0 < yj <n 
has volume n”, and it can be divided into n! equal parts corresponding to each 
possible ordering of the y’s. The integral in the numerator is a little more 
difficult, but it yields to the attack suggested in exercise 17, and we get the 
general formulas 


Meaga gren o 


O<k<t 


where aj = max(j — t, 0). (24) 


qe 5 (E-o E+ n- k= (26) 


t<k<n 


The distribution of K} is exactly the same. Equation (26) was first obtained 
by N. V. Smirnov [Uspekhi Mat. Nauk 10 (1944), 176-206]; see also Z. W. 
Birnbaum and Fred H. Tingey, Annals Math. Stat. 22 (1951), 592-596. Smirnov 
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derived the asymptotic formula 
2 
Pr(kt < s) = tee 33/Vn + O(1/n)) (27) 


for all fixed s > 0; this yields the approximations for large n that appear in 
Table 2. 

Abel’s binomial theorem, Eq. 1.2.6—(16), shows the equivalence of (25) and 
(26). We can extend Table 2 using either formula, but there is an interesting 
tradeoff: Although the sum in (25) has only about s/n terms, when s = t/,/n is 
given, it must be evaluated with multiple-precision arithmetic, because the terms 
are large and their leading digits cancel out. No such problem arises in (26), since 
its terms are all positive; but (26) has n — syn terms. 


EXERCISES 


1. [00] What line of the chi-square table should be used to check whether or not the 
value V = T% of Eq. (5) is improbably high? 

2. [20] If two dice are “loaded” so that, on one die, the value 1 will turn up exactly 
twice as often as any of the other values, and the other die is similarly biased towards 6, 
compute the probability ps that a total of exactly s will appear on the two dice, for 
2< s<12. 


3. [23] Some dice that were loaded as described in the previous exercise were rolled 
144 times, and the following values were observed: 


valueofs= 2 3 4 5 6 7 8 9 10 11 12 
observed number, Ys = 2 6 10 16 18 32 20 13 16 9 2 


Apply the chi-square test to these values, using the probabilities in (1), pretending that 
the dice are not in fact known to be faulty. Does the chi-square test detect the bad 
dice? If not, explain why not. 

4. [23] The author actually obtained the data in experiment 1 of (9) by simulating 
dice in which one was normal, the other was loaded so that it always turned up 1 or 6. 
(The latter two possibilities were equally probable.) Compute the probabilities that 
replace (1) in this case, and by using a chi-square test decide if the results of that 
experiment are consistent with the dice being loaded in this way. 

5. [22] Let F(x) be the uniform distribution, Fig. 3(b). Find KJ and Kọ for the 
following 20 observations: 

0.414, 0.732, 0.236, 0.162, 0.259, 0.442, 0.189, 0.693, 0.098, 0.302, 

0.442, 0.434, 0.141, 0.017, 0.318, 0.869, 0.772, 0.678, 0.354, 0.718, 


and state whether these observations are significantly different from the expected 
behavior with respect to either of these two tests. 

6. [M20] Consider F;,(x), as given in Eq. (10), for fixed x. What is the probability 
that F(x) = s/n, given an integer s? What is the mean value of F,,(x)? What is the 
standard deviation? 

7. [M15] Show that K} and K; can never be negative. What is the largest possible 
value K; can have? 

8. [00] The text describes an experiment in which 20 values of the statistic Kib 
were obtained in the study of a random sequence. These values were plotted, to obtain 
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Fig. 4, and a KS statistic was computed from the resulting graph. Why were the table 
entries for n = 20 used to study the resulting statistic, instead of the table entries for 
n= 10? 

9. [20] The experiment described in the text consisted of plotting 20 values of Ky, 
computed from the maximum-of-5 test applied to different parts of a random sequence. 
We could have computed also the corresponding 20 values of Kio; since Kio has the 
same distribution as K7j, we could lump together the 40 values thus obtained (that is, 
20 of the Kiy’s and 20 of the Kj)’s), and a KS test could be applied so that we would 
get new values Kj), Kj). Discuss the merits of this idea. 

10. [20] Suppose a chi-square test is done by making n observations, and the value V 
is obtained. Now we repeat the test on these same n observations over again (getting, 
of course, the same results), and we put together the data from both tests, regarding 
it as a single chi-square test with 2n observations. (This procedure violates the text’s 
stipulation that all of the observations must be independent of one another.) How is 
the second value of V related to the first one? 

11. [10] Solve exercise 10 substituting the KS test for the chi-square test. 

12. [M28] Suppose a chi-square test is made on a set of n observations, assuming that 
ps is the probability that each observation falls into category s; but suppose that in 
actual fact the observations have probability qs # ps of falling into category s. (See 
exercise 3.) We would, of course, like the chi-square test to detect the fact that the ps 
assumption was incorrect. Show that this will happen, if n is large enough. Prove also 
the analogous result for the KS test. 

13. [M24] Prove that Eqs. (13) are equivalent to Eqs. (11). 

14. [HM26] Let Z, be given by Eq. (18). Show directly by using Stirling’s approxi- 
mation that the multinomial probability 


nlp. .-p"/Y1! ae err] (nr) tpi... pr + O(n’), 


if Z1, Z2,..., Zk are bounded as n — oo. (This idea leads to a proof of the chi-square 
test that is much closer to “first principles,” and requires less handwaving, than the 
derivation in the text.) 


15. [HM24] Polar coordinates in two dimensions are conventionally defined by the 
equations x = rcos@ and y= r sin 0. For the purposes of integration, we have dx dy = 
rdr d0. More generally, in n-dimensional space we can let 


£k =rsin@,...sinO,-1cos@,, 1<k<n, and £n =rsind,...sinOn_1. 
Show that in this case 
dx, dxz...dan = |r" 7" sin”? 01... sin On—2 dr d01...d0n—1|. 
16. [HM35] Generalize Theorem 1.2.11.3A to find the value of 
y(a@+1,2+2V2e+y)/P(«2+1), 


for large x and fixed y,z. Disregard terms of the answer that are O(1/z). Use this 
result to find the approximate solution, t, to the equation 


v t v 
aa) EG) = 
for large v and fixed p, thereby accounting for the asymptotic formulas indicated in 
Table 1. [Hint: See exercise 1.2.11.3-8.] 


60 RANDOM NUMBERS 3.3.1 


17. [HM26] Let t be a fixed real number. For 0 < k < n, let 


x En Tk+2 Tk+1 T2 
Prk (x) = f dEn f d£n—1 ree f d£k+1 f dtk O f dx1; 
n—t n—-1-t k+1-t 0 0 


by convention, let Poo(x) = 1. Prove the following relations: 


s+t En Tk+2 Tk+1 T2 
a) Pnx(x) =f den | d£n—1 se dern f dtk "i dzı. 
n n-1 k+1 t t 


b) Pro(x) = (a +t)"/n! — (x + £)""1/(n — 1)!. 


(k —t)* , 
c) Parla) — Puanga) = ~y Pen mole —k),ifl<k<n. 


d) Obtain a general formula for Prx(x), and apply it to the evaluation of Eq. (24). 


18. [M20] Give a “simple” reason why K; has the same probability distribution 
as KŻ. 


19. [HM48] Develop tests, analogous to the Kolmogorov-Smirnov test, for use with 
multivariate distributions F(a1,...,vr) = Pr(X1 < z1, ..., Xr < £r). (Such proce- 
dures could be used, for example, in place of the “serial test” in the next section.) 


20. [HM41] Deduce further terms of the asymptotic behavior of the KS distribution, 
extending (27). 


21. [M40] Although the text states that the KS test should be applied only when 
F(a) is a continuous distribution function, it is, of course, possible to try to compute 
Ký and K} even when the distribution has jumps. Analyze the probable behavior of 
Kł and K, for various discontinuous distributions F(x). Compare the effectiveness 
of the resulting statistical test with the chi-square test on several samples of random 
numbers. 


22. [HM46] Investigate the “improved” KS test suggested in the answer to exercise 6. 


23. [M22] (T. Gonzalez, S. Sahni, and W. R. Franta.) (a) Suppose that the maxi- 
mum value in formula (13) for the KS statistic K} occurs at a given index j where 
|nF'(X;)| = k. Prove that F(X;) = maxi<i<n{F (Xi) | |n F(Xi)| = k}. (b) Design 
an algorithm that calculates K} and Ky in O(n) steps (without sorting). 

24. [40] Experiment with various probability distributions (p, q,r) on three categories, 
where p +q +r = 1, by computing the exact distribution of the chi-square statistic V 
for various n, thereby determining how accurate an approximation the chi-square 
distribution with two degrees of freedom really is. 


25. [HM26] Suppose Y; = pear, aijXj + pi for 1 < i < m, where Xi, ..., Xn are 
independent random variables with mean zero and unit variance, and the matrix A = 
(aij) has rank n. 


a) Express the covariance matrix C = (cij), where ci; = E(Y; — pi)(Yj — uj), in 
terms of the matrix A. 
b) Prove that if C = (zij) is any matrix such that CCC = C, the statistic 


W=S°S 0% — m) - ws) Gy 
i=1 j=1 


is equal to X? +---+ X2. [Consequently, if the X; have the normal distribution, 
W has the chi-square distribution with n degrees of freedom.] 
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The equanimity of your average tosser of coins 
depends upon a law ... which ensures that 

he will not upset himself by losing too much 
nor upset his opponent by winning too often. 


— TOM STOPPARD, Rosencrantz & Guildenstern are Dead (1966) 


3.3.2. Empirical Tests 


In this section we shall discuss eleven kinds of specific tests that have traditionally 
been applied to sequences in order to investigate their randomness. The discus- 
sion of each test has two parts: (a) a “plug-in” description of how to perform the 
test; and (b) a study of the theoretical basis for the test. (Readers who lack math- 
ematical training may wish to skip over the theoretical discussions. Conversely, 
mathematically inclined readers may find the associated theory quite interesting, 
even if they never intend to test random number generators, since some instruc- 
tive combinatorial questions are involved here. Indeed, this section introduces 
several topics that will be important to us later in quite different contexts.) 
Each test is applied to a sequence 


(Un) = Uo, U1, U2,... (1) 


of real numbers, which purports to be independently and uniformly distributed 
between zero and one. Some of the tests are designed primarily for integer-valued 
sequences, instead of the real-valued sequence (1). In this case, the auxiliary 
sequence 

(Yn) = Y0,¥1,Yo,... (2) 


defined by the rule 
Yn = |dUn] (3) 


is used instead. This is a sequence of integers that purports to be independently 
and uniformly distributed between 0 and d — 1. The number d is chosen for 
convenience; for example, we might have d = 64 = 2° on a binary computer, 
so that Y, represents the six most significant bits of the binary representation 
of Un. The value of d should be large enough so that the test is meaningful, but 
not so large that the test becomes impracticably difficult to carry out. 

The quantities U,,, Yn, and d will have the significance stated above through- 
out this section, although the value of d will probably be different in different 
tests. 


A. Equidistribution test (Frequency test). The first requirement that 
sequence (1) must meet is that its numbers are, in fact, uniformly distributed 
between zero and one. There are two ways to make this test: (a) Use the 
Kolmogorov-Smirnov test, with F(x) = x fr 0 < x < 1. (b) Let d be a 
convenient number, such as 100 on a decimal computer, 64 or 128 on a binary 
computer, and use the sequence (2) instead of (1). For each integer r, 0 < r < d, 
count the number of times that Y; = r for 0 < j < n, and then apply the 
chi-square test using k = d and probability ps = 1/d for each category. 
The theory behind this test has been covered in Section 3.3.1. 
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B. Serial test. More generally, we want pairs of successive numbers to be 
uniformly distributed in an independent manner. The sun comes up just about as 
often as it goes down, in the long run, but that doesn’t make its motion random. 

To carry out the serial test, we simply count the number of times that the 
pair (Y2;, Y2j+1) = (q,r) occurs, for 0 < j < n; these counts are to be made for 
each pair of integers (q,r) with 0 < q,r < d, and the chi-square test is applied 
to these k = d? categories with probability 1/d? in each category. As with the 
equidistribution test, d may be any convenient number, but it will be somewhat 
smaller than the values suggested above since a valid chi-square test should have 
n large compared to k (say n > 5d? at least). 

Clearly we can generalize this test to triples, quadruples, etc., instead of 
pairs (see exercise 2); however, the value of d must then be severely reduced in 
order to avoid having too many categories. When quadruples and larger numbers 
of adjacent elements are considered, we therefore make use of less exact tests such 
as the poker test or the maximum test described below. 

Notice that 2n numbers of the sequence (2) are used in this test in order 
to make n observations. It would be a mistake to perform the serial test 
on the pairs (Yo, Y1), (Y1, Y2), .--, (Yn-1, Yn); can the reader see why? We 
might perform another serial test on the pairs (Y2;+1, Y2j+2), and expect the 
sequence to pass both tests, remembering that the tests aren’t independent of 
each other. Alternatively, George Marsaglia has proved that, if the pairs (Yo, Y1), 
(Y1, Yo), --., (Yn—-1, Yn) are used, and if we use the usual chi-square method to 
compute both the statistics V2 for the serial test and V; for the frequency test on 
Yo,---;Yn—1 with the same value of d, then V2 — V; should have the chi-square 
distribution with d(d — 1) degrees of freedom when n is large. (See exercise 24.) 


C. Gap test. Another test is used to examine the length of “gaps” between 
occurrences of Uj in a certain range. If a and @ are two real numbers with 
0<a< 6 <1, we want to consider the lengths of consecutive subsequences U}, 
Uj41, ---, Uj4r in which U;+, lies between a and p but the other U’s do not. 
(This subsequence of r + 1 numbers represents a gap of length r.) 


Algorithm G (Data for gap test). The following algorithm, applied to the 
sequence (1) for any given values of a and 8, counts the number of gaps of 
lengths 0, 1, ..., #-— 1 and the number of gaps of length > t, until n gaps have 
been tabulated. 


G1. [Initialize.] Set j + —1, s + 0, and set COUNT[r] + 0 forO<r<t. 
G2. [Set r zero.] Set r 4+ 0. 

G3. [a < U; < B?] Increase j by 1. If U; > a and U; < $, go to step G5. 
G4. [Increase r.] Increase r by one, and return to step G3. 


G5. [Record the gap length.] (A gap of length r has now been found.) If r > t, 
increase COUNT[t] by one, otherwise increase COUNT[r] by one. 


G6. |n gaps found?] Increase s by one. If s < n, return to step G2. I 
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G1. Initialize > G2. Set r zero G3. a < Uj < 8? No G4. Increase r 


IX 


Yes 
No V 
G6. n gaps found? k — G5. Record the gap length 
Yes 
v 


Fig. 6. Gathering data for the gap test. (Algorithms for the “coupon-collector’s test” 
and the “run test” are similar.) 


After Algorithm G has been performed, the chi-square test is applied to 
the k = t + 1 values of COUNT[0], COUNT[1], ..., COUNT[¢], using the following 
probabilities: 


Pr =p(1- p)", fr0<r<t-1l; pp=(1—p)’. (4) 


Here p = 6 — a is the probability that a < U; < 8. The values of n and t are to 
be chosen, as usual, so that each of the values of COUNT[r] is expected to be 5 or 
more, preferably more. 

The gap test is often applied with a = 0 or 8 = 1 in order to omit one of 
the comparisons in step G3. The special cases (a, 8) = (0,4) or (4,1) give rise 
to tests that are sometimes called “runs above the mean” and “runs below the 
mean,” respectively. 

The probabilities in Eq. (4) are easily deduced, so this derivation is left to 
the reader. Notice that the gap test as described above observes the lengths of n 
gaps; it does not observe the gap lengths among n numbers. If the sequence (U;,) 
is sufficiently nonrandom, Algorithm G might not terminate. Other gap tests 
that examine a fixed number of U’s have also been proposed (see exercise 5). 


D. Poker test (Partition test). The “classical” poker test considers n groups 
of five successive integers, {Y5;, Y¥sj41,---, Ysj+4} for 0 < j < n, and observes 
which of the following seven patterns is matched by each (orderless) quintuple: 
All different: abcde 
One pair: aabcd 
Two pairs: aabbc 
Three of a kind: aaabc 
Full house: aaabb 
Four of a kind: aaaab 
Five of a kind: aaaaa 
A chi-square test is based on the number of quintuples in each category. 


It is reasonable to ask for a somewhat simpler version of this test, to facilitate 
the programming involved. A good compromise would simply be to count the 
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number of distinct values in the set of five. We would then have five categories: 


5 values = all different; 
4 values = one pair; 
3 values = two pairs, or three of a kind; 
2 values = full house, or four of a kind; 
1 value = five of a kind. 
This breakdown is easier to determine systematically, and the test is nearly 
as good. 
In general we can consider n groups of k successive numbers, and we can 


count the number of k-tuples with r different values. A chi-square test is then 
made, using the probability 


d(d—1)...(d—r+1)(k 
p= AD ears ry (5) 
that there are r different. (The Stirling numbers {ht are defined in Section 1.2.6, 
and they can readily be computed using the formulas given there.) Since the 
probability p, is very small when r = 1 or 2, we generally lump a few categories 
of low probability together before the chi-square test is applied. 

To derive the proper formula for p,, we must count how many of the d* 
k-tuples of numbers between 0 and d — 1 have exactly r different elements, and 
divide the total by d*. Since d(d—1)...(d— r + 1) is the number of ordered 
choices of r things from a set of d objects, we need only show that {ky is the 
number of ways to partition a set of k elements into exactly r parts. Therefore 
exercise 1.2.6-64 completes the derivation of Eq. (5). 


E. Coupon collector’s test. The next test is related to the poker test some- 

what as the gap test is related to the frequency test. We use the sequence Yo, 

Yi, ..., and we observe the lengths of segments Yj;41, Yj+2, ..., Yj4r that are 

required to get a “complete set” of integers from 0 to d—1. Algorithm C describes 

this precisely: 

Algorithm C (Data for coupon collector’s test). Given a sequence of integers 

Yo, Yı, .--, with 0 < Y; < d, this algorithm counts the lengths of n consecutive 

“coupon collector” segments. At the conclusion of the algorithm, COUNT[r] is the 

number of segments with length r, for d < r < t, and COUNT[¢] is the number of 

segments with length > t. 

C1. [Initialize] Set j 4 —1, s + 0, and set COUNT[r] + 0 ford < r < t. 

C2. [Set q,r zero.] Set q + r + 0, and set OCCURS[k] + 0 for 0 < k < d. 

C3. [Next observation.] Increase r and j by 1. If OCCURS[Y;] 4 0, repeat this 
step. 

C4. [Complete set?] Set OCCURS[Y;] +— 1 and q + q+ 1. (The subsequence 
observed so far contains q distinct values; if q = d, we therefore have a 
complete set.) If q < d, return to step C3. 
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C5. [Record the length.] If r > t, increase COUNT[t] by one, otherwise increase 
COUNT[r] by one. 


C6. [n found?] Increase s by one. If s < n, return to step C2. I 


For an example of this algorithm, see exercise 7. We may think of a boy col- 
lecting d types of coupons, which are randomly distributed in his breakfast cereal 
boxes; he must keep eating more cereal until he has one coupon of each type. 

A chi-square test is to be applied to COUNT[d], COUNT[d + 1], ..., COUNT[d], 
with k = t— d+ 1, after Algorithm C has counted n lengths. The corresponding 
probabilities are 


d! ¢r—1 d! t—1 
sa plat gees m=i a { d } o 


To derive these probabilities, we simply note that if g, denotes the probability 
that a subsequence of length r is incomplete, then 


d! cr 
w=1-F{3} 


by Eq. (5); for this means we have an r-tuple of elements that do not have all d 
different values. Then (6) follows from the relations pg = q—1 and 


Pr = Qr-1 — Ur ford<r<t. 


For formulas that arise in connection with generalizations of the coupon 
collector’s test, see exercises 9 and 10 and also the papers by George Pólya, 
Zeitschrift fiir angewandte Math. und Mech. 10 (1930), 96-97; Hermann von 
Schelling, AMM 61 (1954), 306-311. 


F. Permutation test. Divide the input sequence into n groups of t elements 
each, that is, (Ujt, Ujt+1,- - - , Ujt+t-1) for 0 < j < n. The elements in each group 
can have t! possible relative orderings; the number of times each ordering appears 
is counted, and a chi-square test is applied with k = t! and with probability 1/¢! 
for each ordering. 

For example, if t = 3 we would have six possible categories, according to 
whether U3; < U3541 < U3j+2 or U3; < U3j+2 < U3j+1 or ++: or U3j+2 < 
U3;41 < U3;. We assume in this test that equality between U’s does not occur; 
such an assumption is justified, for the probability that two U’s are equal is zero. 

A convenient way to perform the permutation test on a computer makes use 
of the following algorithm, which is of interest in itself: 


Algorithm P (Analyze a permutation). Given a sequence of distinct elements 
(Ui, ..., Ut), we compute an integer f(Ui,...,Uz) such that 


0 < f(U1,..., U4) < t!, 


and f(Ui,...,Uz) = f(Vı,..-, Ve) if and only if (U1,...,U¢) and (Vi, ..., Vè) 
have the same relative ordering. 
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P1. [Initialize.] Set r + t, f + 0. (During this algorithm we will have 0 < f < 
t!/r!.) 

P2. [Find maximum.] Find the maximum of {U;,...,U;-}, and suppose that Us 
is the maximum. Set fcr-f+s—-1. 


P3. [Exchange.] Exchange U, © Us. 
P4. [Decrease r.] Decrease r by one. If r > 1, return to step P2. I 


The sequence (U;,...,U;) will have been sorted into ascending order when 
this algorithm stops. To prove that the result f uniquely characterizes the initial 
order of (Ui,...,Uz), we note that Algorithm P can be run backwards: 


For r = 2, 3,..., #, 
set s + fmodr, f 4+ |f/r], 
and exchange U, & Us+1. 


It is easy to see that this will undo the effects of steps P2-P4; hence no two 
permutations can yield the same value of f, and Algorithm P performs as 
advertised. 

The essential idea that underlies Algorithm P is a mixed-radix representation 
called the “factorial number system”: Every integer in the range 0 < f < t! can 
be uniquely written in the form 


f=(..(a1 X (t — 1) + 4-2) x (t — 2) +--+) x 2+ c1 
= (t— 1)! c&—1 + (t — 2)! c2 +- + 2!c2 + 1! c1 (7) 
where the “digits” cj are integers satisfying 
O<c <j, for 1<j<t. (8) 
In Algorithm P, c,_; = s — 1 when step P2 is performed for a given value of r. 


G. Run test. A sequence may also be tested for “runs up” and “runs down.” 
This means that we examine the length of monotone portions of the original 
sequence (segments that are increasing or decreasing). 

As an example of the precise definition of a run, consider the sequence of ten 
digits “1298536704”. Putting a vertical line at the left and right and between 
Xj and Xj41 whenever X; > Xj+1, we obtain 


|1 2 9/8|5|3 6 7/0 4], (9) 


which displays the “runs up”: There is a run of length 3, followed by two runs 
of length 1, followed by another run of length 3, followed by a run of length 2. 
The algorithm of exercise 12 shows how to tabulate the length of “runs up.” 
Unlike the gap test and the coupon collector’s test (which are in many other 
respects similar to this test), we should not apply a chi-square test to the run 
counts, since adjacent runs are not independent. A long run will tend to be 
followed by a short run, and conversely. This lack of independence is enough to 
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invalidate a straightforward chi-square test. Instead, the following statistic may 
be computed, when the run lengths have been determined as in exercise 12: 


V=—— SD (countfil — nb,)(count[j] — nbj)aiy, tie) 


n — E 
1<4i,j <6 


where n is the length of the sequence, and the matrices of coefficients A = 
(Giz )1<i,9<6 and B= (bi)i<i<6 are given by 


4529.4 9044.9 13568 18091 22615 27892 ł 
9044.9 18097 27139 36187 45234 55789 A 
ee 13568 27139 40721 54281 67852 83685 Be E 
18091 36187 54281 72414 90470 111580 |’ an 
22615 45234 67852 90470 113262 139476 in 
27892 55789 83685 111580 139476 172860 a0 


(11) 
(The values of a;; shown here are approximate only; exact values can be obtained 
from formulas derived below.) The statistic V in (10) should have the chi-square 
distribution with six, not five, degrees of freedom, when n is large. The value 
of n should be, say, 4000 or more. The same test can be applied to “runs down.” 


A vastly simpler and more practical run test appears in exercise 14, so 
a reader who is interested only in testing random number generators should 
skip the next few pages and go on to the “maximum-of-t test” after looking at 
exercise 14. On the other hand it is instructive from a mathematical standpoint 
to see how a complicated run test with interdependent runs can be treated, so 
we shall now digress for a moment. 

Given any permutation of n elements, let Z,; = 1 if position 7 is the 
beginning of an ascending run of length p or more, and let Z,; = 0 otherwise. 
For example, consider the permutation (9) with n = 10; we have 


Zii = 4m = 431 = Zia = 215 = 216 = Z2 = 236 = Zi9 = Zə = 1, 
and all other Z’s are zero. With this notation, 
Ry = Zp1 + Zp2 +++ + Zpn (12) 
is the number of runs of length > p, and 
Rp = R, = AE (13) 


is the number of runs of length p exactly. Our goal is to compute the mean value 
of Rp, and also the covariance 


covar( Rp, Rg) = mean ( (Rp — mean(Rp)) (Rq — mean(R,))), 


which measures the interdependence of Rp and Ry. These mean values are to be 
computed as the average over the set of all n! permutations. 
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Equations (12) and (13) show that the answers can be expressed in terms 
of the mean values of Zp; and of Z,;Z,;, so as the first step of the derivation we 
obtain the following results (assuming that i < j): 


P+ ôi p 
1 , ifi<n-p+l; 
oe i= (p+ 1)! 
: 0, otherwise. 
+ 6; she 
eae ititp<j<n—ath (4) 
i ! ! 
T ZpiZaj = pt ôi DPOTON irp a pee ocd: 
i (p+1)!a! (p+q+1)! T 
0, otherwise. 


The -signs stand for summation over all possible permutations. To illustrate 
the calculations involved here, we will work the most difficult case, when 7+ p = 
j < n—q+1, and when ¿i > 1. The quantity ZpiZqąj is either zero or one, 
so the summation consists of counting all permutations U1U2 ...Un for which 
Zpi = Ząj = 1, that is, all permutations such that 


Uj-1 > Ui <- < Uitp-1 > Ui+p KoL Uitp+q-1- (15) 


The number of such permutations may be enumerated as follows: There are 


P co a) ways to choose the elements for the positions indicated in (15); there 
are 
pt+q ptqtl p+qtl 
+q41)(?* 7) -( )-( )+1 16 
w+a+ (P inet i (16) 


ways to arrange them in the order (15), as shown in exercise 13; and there 
are (n — p — q — 1)! ways to arrange the remaining elements. Thus there are 
(ea —p—q-1)! times (16) ways in all, and we divide by n! to get the 
desired formula. 

From relations (14) a rather lengthy calculation leads to 


mean(R,) = (n + 1)p/(p + 1)! — (p— 1)/p!, 1<p<n; (17) 


covar( Rp, Ri) = mean( R, R}) — mean( Rp) mean( R3) 


1 
= 5 I 5 ZpiZąj — mean(R,) mean( R3) 


1<i,j<n 


(18) 


— mean(R;) + fpg n), ifp+q í n, 
-~ | mean(R;) — mean(R,) mean( R), ifp+q>n, 


where t = max(p, q), s = p + q, and 
_ s(1 — pq) + pq 2s s—1 
fean = at (EFG cam) +2( s! ) 


(s? — s — 2)pq — 3 peg? +1 
+ - (9) 
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This expression for the covariance is unfortunately quite complicated, but it is 
necessary for a successful run test as described above. From these formulas it is 


easy to compute 


mean(R,,) = mean(R;,) — mean( Rp41), 
covar(R,, R3) = covar( Rp, Rg) — covar( Rp41, Rg), (20) 
covar(R,, R,) = covar( Rp, R4) — covar(R,, Rg+1)- 
In Annals Math. Stat. 15 (1944), 163-165, J. Wolfowitz proved that the quan- 


tities R4, Ry,..., R,_,, R} become normally distributed as n > ov, subject to 
the mean and covariance expressed above; this implies that the following test for 
runs is valid: Given a sequence of n random numbers, compute the number of 
runs Rp of length p for 1 < p < t, and also the number of runs R; of length t or 


more. Let 
Qi = Ry = mean(Rı), sony Qt-1 = Ri-1 — mean(R;-1), (21) 
— f 7 
Q, = R; — mean( R}). 
Form the matrix C of the covariances of the R’s; for example, Ci3 = 
covar( R4, R3), while C1; = covar(R,, Ri). When t = 6, we have 
C = nC + C3, (22) 
where 
23 = -5 —433 A8 -121 
180 360 336 60480 5670 181440 
-7 2843 —989 —7159 —10019 ~1303 
360 20160 20160 362880 1814400 907200 
-5 —989 54563 —21311 —62369 -7783 
C = 336 20160 907200 1814400 19958400 9979200 
1~ | 433 —7159 -21311 886657 —257699 —62611 , 
60480 362880 1814400 39916800 239500800 239500800 
-13 —10019 _—62369 —257699 29874811 —1407179 
5670 1814400 19958400 239500800 5448643200 21794572800 
—121 —1303 —7783 —62611 1407179 2134697 
181440 907200 9979200 239500800 21794572800 1816214400 
83 —29 Si —41 91 41 
180 T80 210 12096 25920 18144 
—29 -305 319 2557 10177 413 
180 4032 20160 72576 604800 64800 
-11 319 —58747 19703 239471 39517 
= 210 20160 907200 604800 19958400 9979200 
2= | —4 2557 19703 —220837 1196401 360989 
12096 72576 604800 4435200 239500800 239500800 
91 10177 239471 1196401 —139126639 4577641 
25920 604800 19958400 239500800 7264857600 10897286400 
41 413 39517 360989 4577641 122953057 
18144 64800 9979200 239500800 10897286400 21794572800 
if n > 12. Now form A = (aij), the inverse of the matrix C, and compute 


t 
Ži j= 


distribution with t degrees of freedom. 
The matrix A given earlier in (11) is the inverse of C4 to five significant fig- 


ures. The true inverse, A, is n107! —n-?2C[1C,Cy1 +n 3C/'C,C7'C, CT! - 


, and it turns out that C7 !C, CT 
by (10), V ~ Q7C7!Q/(n — 6), where Q = 


QiQjaij. The result for large n should have approximately the chi-square 


Qi)” 


1 is very nearly equal to —6C] 1. Therefore 


(ies 
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H. Maximum-of-t test. For 0 < j < n, let V; = max(W;;, Uij41,..., Utj+t-1)- 


Now apply the Kolmogorov—Smirnov test to the sequence Vo, Vi, ---, Vn—1, 
with the distribution function F(x) = zt, 0 < x < 1. Alternatively, apply the 
equidistribution test to the sequence Vi, V£, ..., Vii_4. 


To verify this test, we must show that the distribution function for the V; is 
F(x) =x". The probability that max(U1, U2, ..., U+) < x is the probability that 
Uı < xz and Uz < x and... and U, < x, which is the product of the individual 


probabilities, namely zg... = zt. 


I. Collision test. Chi-square tests can be made only when a nontrivial number 
of items are expected in each category. But another kind of test can be used 
when the number of categories is much larger than the number of observations; 
this test is related to “hashing,” an important method for information retrieval 
that we shall study in Section 6.4. 

Suppose we have m urns and we throw n balls at random into those urns, 
where m is much greater than n. Most of the balls will land in urns that were 
previously empty, but if a ball falls into an urn that already contains at least one 
ball we say that a “collision” has occurred. The collision test counts the number 
of collisions, and a generator passes this test if it doesn’t induce too many or too 
few collisions. 

To fix the ideas, suppose m = 27° and n = 2!4. Then each urn will receive 
only one 64th of a ball, on the average. The probability that a given urn will 
contain exactly k balls is pp = ()m~*(1—m1)"-*, so the expected number of 
collisions per urn is 


Yok = Dp = SO kpr- J pr = — — 1+ po 


k>1 k>0 k>1 


Since pp = (1 — mt)” = 1 — nm™! + ($)m~? — smaller terms, we find that 
the average total number of collisions taken over all m urns is slightly less than 
n?/(2m) = 128. (The actual value is ~ 127.33.) 

We can use the collision test to rate a random number generator in a large 
number of dimensions. For example, when m = 2?° and n = 214 we can test the 
20-dimensional randomness of a number generator by letting d = 2 and forming 
20-dimensional vectors Vj = (Y20;, Y20j41,---, Y20j+19) for 0 < j < n. We keep 
a table of m = 27° bits to determine collisions, one bit for each possible value of 
the vector Vj; on a computer with 32 bits per word, this amounts to 2!° words. 
Initially all 27° bits of this table are cleared to zero; then for each Vj, if the 
corresponding bit is already 1 we record a collision, otherwise we set the bit to 1. 
This test can also be used in 10 dimensions with d = 4, and so on. 

To decide if the test is passed, we can use the following table of percentage 
points when m = 27° and n = 214: 


collisions < 101 108 119 126 134 145 1538 
with probability .009 .043 .244 .476 .742 .946 .989 


The theory underlying these probabilities is the same we used in the poker test, 
Eq. (5); the probability that c collisions occur is the probability that n — c urns 
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are occupied, namely 


mm = A E n \. 


m” 


n—c 
Although m and n are very large, it is not difficult to compute these probabilities 
using the following method: 


Algorithm S (Percentage points for collision test). Given m and n, this 
algorithm determines the distribution of the number of collisions that occur 
when n balls are scattered into m urns. An auxiliary array A[0], A[1], ..., 

Ajn] of floating point numbers is used for the computation; actually A[j] will be 

nonzero only for jo < j < jı, and ji — jo will be at most of order logn, so it 

would be possible to get by with considerably less storage. 

S1. [Initialize.] Set A[j] < 0 for 0 < j < n; then set A[1] + 1 and jo + jı + 1. 
Then do step S2 exactly n — 1 times and go on to step S3. 

S2. [Update probabilities.] (Performing this step once corresponds to tossing a 
ball into an urn; A[j] represents the probability that exactly j of the urns are 
occupied.) Set jı < jı +1. Then for j © jı, jı — 1, ..., jo (in this order), 
set A[j] + (j/m)A[j] + ((1 + 1/m) — (j/m)) A[j — 1]. If A[j] has become 
very small as a result of this calculation, say A[j] < 10~?°, set A[j] < 0; 
and in such a case, decrease jı by 1 if 7 = jı, or increase jg by 1 if j = jo. 

S3. [Compute the answers.] In this step we make use of an auxiliary table 
(Ty, T2,...,Temax) = (.01, .05, .25, .50, .75, .95, .99, 1.00) containing the 
specified percentage points of interest. Set p+ 0, t + 1, and j + jo— 1. Do 
the following iteration until t = tmax: Increase j by 1, and set p + p+ AJ]; 
then if p > T;, output n — j — 1 and 1 — p (meaning that with probability 
1 — p there are at most n — j — 1 collisions) and repeatedly increase t by 1 
untilp<T;. 1 


J. Birthday spacings test. George Marsaglia introduced a new kind of test in 
1984: We throw n balls into m urns, as in the collision test, but now we think of 
the urns as “days of a year” and the balls as “birthdays.” Suppose the birthdays 
are (Y1,...,¥n), where 0 < Yp < m. Sort them into nondecreasing order Y(1) < 
+++ < Yin); then define n “spacings” $1 = Yo) — Ya), ---, Sn—1 = Yin) — Y(n-1); 
Sn = Ya) + m — Yin); finally sort the spacings into order, Sa) < +--+ < Sin). Let 
R be the number of equal spacings, namely the number of indices j such that 
1< j <n and S) = S(j-1). When m = 2?° and n = 512, we should have 


R= 0 1 2 3 or more 
with probability .368801577 .369035243 .183471182 .078691997 


(The average number of equal spacings for this choice of m and n should be 
approximately 1.) Repeat the test 1000 times, say, and do a chi-square test with 
3 degrees of freedom to compare the empirical R’s with the correct distribution; 
this will tell whether or not the generator produces reasonably random birthday 
spacings. Exercises 28-30 develop the theory behind this test and formulas for 
other values of m and n. 
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Such a test of birthday spacings is important primarily because of the 
remarkable fact that lagged Fibonacci generators consistently fail it, although 
they pass the other traditional tests quite nicely. [Dramatic examples of such 
failures were reported by Marsaglia, Zaman, and Tsang in Stat. and Prob. Letters 
9 (1990), 35-39.] Consider, for example, the sequence 


Xn = (Xn-24 + Xn-55) mod m 
of Eq. 3.2.2-(7). The numbers of this sequence satisfy 
Xn + Xn—86 = Xn—24 + Xn—31 (modulo m) 


because both sides are congruent to Xn-24 + Xn-55 + Xn—g6- Therefore two 
pairs of differences are equal: 


Xn = Xn—24 = Xn-31 oa Xn—86; 


and 
Xn iai Xn-31 = Xn-24 = X nage: 


Whenever Xn is reasonably close to Xn—24 or Xn-31 (as it should be in a truly 
random sequence), the difference has a good chance of showing up in two of 
the spacings. So we get significantly more cases of equality — typically R ~ 2 
on the average, not 1. But if we discount from R any equal spacings that 
arise from the stated congruence, the resulting statistic R’ usually does pass 
the birthday test. (One way to avoid failure is to discard certain elements of 
the sequence, using for example only Xo, X2, X4, ... as random numbers; then 
we never get all four elements of the set {Xn, Xn—24, Xn-31, Xn-s6}, and the 
birthday spacings are no problem. An even better way to avoid the problem 
is to discard consecutive batches of numbers, as suggested by Lüscher; see 
Section 3.2.2.) Similar remarks apply to the subtract-with-borrow and add- 
with-carry generators of exercise 3.2.1.1-14. 


K. Serial correlation test. We may also compute the following statistic: 


n(UoU1+U1U2 +- j -+Un-2Un-1 +Un-1U0)— (Uo + Ui 4+: j -+Un—1)? 
n(Ug + U? +---+U?_,) — (Uo +U, +--+ + Un)? 


This is the “serial correlation coefficient,” a measure of the extent to which U;41 
depends on U;. 

Correlation coefficients appear frequently in statistical work. If we have n 
quantities Up, U1, ..., Un—1 and n others Vo, Vi, ..., Vn—1, the correlation 
coefficient between them is defined to be 


nY UV) - U5) (V5) 
2 2 
VOET- (LU) @Xve- (vy) 
All summations in this formula are to be taken over the range 0 < j < n; 
Eq. (23) is the special case Vj = U(j+1) mod n- The denominator of (24) is zero 


when Up = Ui = -:-: = Un-1 or Vo = Vi = --- = Vn_1; we exclude that case 
from discussion. 


C= 


(23) 


C= 


(24) 


3.3.2 EMPIRICAL TESTS 73 


A correlation coefficient always lies between —1 and +1. When it is zero or 
very small, it indicates that the quantities U; and V; are (relatively speaking) 
independent of each other, whereas a value of +1 indicates total linear depen- 
dence. In fact, Vj = a + U; for all j in the latter case, for some constants a 
and 8. (See exercise 17.) 

Therefore it is desirable to have C in Eq. (23) close to zero. In actual 
fact, since UpU, is not completely independent of U;U2, the serial correlation 
coefficient is not expected to be exactly zero. (See exercise 18.) A “good” value 
of C will be between un — 20, and un + 20n, where 


2 
—1 2 n 


n= 2s o >2. 
Pma Tm- Pa- ” (25) 


We expect C to be between these limits about 95 percent of the time. 

The formula for o? in (25) is an upper bound, valid for serial correlations 
between independent random variables from an arbitrary distribution. When 
the U’s are uniformly distributed, the true variance is obtained by subtracting 
24n-2 + O(n logn). (See exercise 20.) 

Instead of simply computing the correlation coefficient between the obser- 


vations (Uo, U1,...,Un—1) and their immediate successors (Ui,...,Un—1, Uo), 
we can also compute it between (Up,Ui,...,Un—1) and any cyclically shifted 
sequence (Ug,...,Un—1,U0,...,Ug—1); the cyclic correlations should be small 


for 0 < q < n. A straightforward computation of Eq. (24) for all q would 
require about n? multiplications, but it is actually possible to compute all the 
correlations in only O(nlogn) steps by using “fast Fourier transforms.” (See 
Section 4.6.4; see also L. P. Schmid, CACM 8 (1965), 115.) 


L. Tests on subsequences. External programs often call for random numbers 
in batches. For example, if a program works with three random variables X, Y, 
and Z, it may consistently invoke the generation of three random numbers at a 
time. In such applications it is important that the subsequences consisting of 
every third term of the original sequence be random. If the program requires 
q numbers at a time, the sequences 


Uo, Ug, U2q3-- -5 G1, Uqti, Uogtis-++3 +++} Ug—1, Uog—1, Uaq—1,--- 


can each be put through the tests described above for the original sequence Up, 
Ui, Ug, .... 

Experience with linear congruential sequences has shown that these derived 
sequences rarely if ever behave less randomly than the original sequence, unless q 
has a large factor in common with the period length. On a binary computer with 
m equal to the word size, for example, a test of the subsequences for q = 8 will 
tend to give the poorest randomness for all q < 16; and on a decimal computer, 
q = 10 yields the subsequences most likely to be unsatisfactory. (This can be 
explained somewhat on the grounds of potency, since such values of q will tend 
to lower the potency. Exercise 3.2.1.2—20 provides a more detailed explanation.) 
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M. Historical remarks and further discussion. Statistical tests arose 
naturally in the course of scientists’ efforts to “prove” or “disprove” hypotheses 
about various observed data. The best-known early papers dealing with the 
testing of artificially generated numbers for randomness are two articles by M. G. 
Kendall and B. Babington-Smith in the Journal of the Royal Statistical Society 
101 (1938), 147-166, and in the supplement to that journal, 6 (1939), 51-61. 
Those papers were concerned with the testing of random digits between 0 and 9, 
rather than random real numbers; for this purpose, the authors discussed the 
frequency test, serial test, gap test, and poker test, although they misapplied 
the serial test. Kendall and Babington-Smith also used a variant of the coupon 
collector’s test; the method described in this section was introduced by R. E. 
Greenwood in Math. Comp. 9 (1955), 1-5. 

The run test has a rather interesting history. Originally, tests were made 
on runs up and down at once: A run up would be followed by a run down, then 
another run up, and so on. Note that the run test and the permutation test 
do not depend on the uniform distribution of the U’s, but only on the fact that 
U; = U; occurs with probability zero when 7 4 j; therefore these tests can be 
applied to many types of random sequences. The run test in primitive form was 
originated by J. Bienaymé [Comptes Rendus Acad. Sci. 81 (Paris, 1875), 417- 
423]. Some sixty years later, W. O. Kermack and A. G. McKendrick published 
two extensive papers on the subject [Proc. Royal Society Edinburgh 57 (1937), 
228-240, 332-376]; as an example they stated that Edinburgh rainfall between 
the years 1785 and 1930 was “entirely random in character” with respect to the 
run test (although they examined only the mean and standard deviation of the 
run lengths). Several other people began using the test, but it was not until 
1944 that the use of the chi-square method in connection with this test was 
shown to be incorrect. A paper by H. Levene and J. Wolfowitz in Annals Math. 
Stat. 15 (1944), 58-69, introduced the correct run test (for runs up and down, 
alternately) and discussed the fallacies in earlier misuses of that test. Separate 
tests for runs up and runs down, as proposed in the text above, are more suited 
to computer application, so we have not given the more complex formulas for 
the alternate-up-and-down case. See the survey paper by D. E. Barton and C. L. 
Mallows, Annals Math. Stat. 36 (1965), 236-260. 

Of all the tests we have discussed, the frequency test and the serial corre- 
lation test seem to be the weakest, in the sense that nearly all random number 
generators pass them. Theoretical grounds for the weakness of these tests are 
discussed briefly in Section 3.5 (see exercise 3.5-26). The run test, on the other 
hand, is rather strong: The results of exercises 3.3.3-23 and 24 suggest that 
linear congruential generators tend to have runs somewhat longer than normal 
if the multiplier is not large enough, so the run test of exercise 14 is definitely 
to be recommended. 

The collision test is also highly recommended, since it has been specially 
designed to detect the deficiencies of many poor generators that have unfortu- 
nately become widespread. Based on ideas of H. Delgas Christiansen [Inst. Math. 
Stat. and Oper. Res., Tech. Univ. Denmark (October 1975), unpublished], this 
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test was the first to be developed after the advent of computers; it is specifically 
intended for computer use, and unsuitable for hand calculation. 


The reader probably wonders, “Why are there so many tests?” It has been 
said that more computer time is spent testing random numbers than using them 
in applications! This is untrue, although it is possible to go overboard in testing. 

The need for making several tests has been amply documented. People have 
found, for example, that some numbers generated by a variant of the middle- 
square method have passed the frequency test, gap test, and poker test, yet 
flunked the serial test. Linear congruential sequences with small multipliers have 
been known to pass many tests, yet fail on the run test because there are too 
few runs of length one. The maximum-of-t test has also been used to ferret out 
some bad generators that otherwise seemed to perform respectably. A subtract- 
with-borrow generator fails the gap test when the maximum gap length exceeds 
the largest lag; see Vattulainen, Kankaala, Saarinen, and Ala-Nissila, Computer 
Physics Communications 86 (1995), 209-226, where a variety of other tests are 
also reported. Lagged Fibonacci generators, which are theoretically guaranteed 
to have equally distributed least-significant bits, still fail some simple variants of 
the 1-bit equidistribution test (see exercises 31 and 35, also 3.6-14). 

Perhaps the main reason for doing extensive testing on random number 
generators is that people misusing Mr. X’s random number generator will hardly 
ever admit that their programs are at fault: They will blame the generator, until 
Mr. X can prove to them that his numbers are sufficiently random. On the other 
hand, if the source of random numbers is only for Mr. X’s personal use, he might 
decide not to bother to test them, since the techniques recommended in this 
chapter have a high probability of being satisfactory. 

As computers become faster, more random numbers are consumed than ever 
before, and random number generators that once were satisfactory are no longer 
good enough for sophisticated applications in physics, combinatorics, stochastic 
geometry, etc. George Marsaglia has therefore introduced a number of stringent 
tests, which go well beyond classical methods like the gap and poker tests, in 
order to meet the new challenges. For example, he found that the sequence 
Xn+1 = (62605X,, + 113218009) mod 27% had a noticeable bias in the following 
experiment: Generate 27! random numbers X,, and extract their 10 leading bits 
Yn = |Xn/2'°|. Count how many of the 2% possible pairs (y,y’) of 10-bit 
numbers do not occur among (Yj, Y2), (Y2, Y3), .--, (Yo21_1, You). There ought 
to be about 141909.33 missing pairs, with standard deviation ~ 290.46 (see 
exercise 34). But six consecutive trials, starting with X, = 1234567, produced 
counts that were all between 1.5 and 3.5 standard deviations too low. The 
distribution was a bit too “flat” to be random — probably because 2?! numbers 
is a significant fraction, 1/256, of the entire period. A similar generator with 
multiplier 69069 and modulus 2°° proved to be better. Marsaglia and Zaman call 
this procedure a “monkey test,” because it counts the number of two-character 
combinations that a monkey will miss after typing randomly on a keyboard 
with 1024 keys; see Computers and Math. 26,9 (November 1993), 1-10, for the 
analysis of several monkey tests. 
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EXERCISES 
1. [10] Why should the serial test described in part B be applied to (Yo, Y1), (Y2, Y3), 
wey Yon-2, Yoni) instead of to (Yo; Yı), (Yı, Y2), e. (Yasi; Ya)? 


2. [10] State an appropriate way to generalize the serial test to triples, quadruples, 
etc., instead of pairs. 


3. [M20] How many U’s need to be examined in the gap test (Algorithm G) before 
n gaps have been found, on the average, assuming that the sequence is random? What 
is the standard deviation of this quantity? 


4. [M12] Prove that the probabilities in (4) are correct for the gap test. 


5. [M23] The “classical” gap test used by Kendall and Babington-Smith considers 
the numbers Uo, U1, ..., Un—1 to be a cyclic sequence with Un+; identified with U;. 
Here N is a fixed number of U’s that are to be subjected to the test. If n of the numbers 
Uo, ..., Un-1 fall into the range a < U; < £, there are n gaps in the cyclic sequence. 
Let Z, be the number of gaps of length r, for 0 < r < t, and let Z be the number of 
gaps of length > t; show that the quantity V = X o<r<t(Zr — npr)?/npr should have 
the chi-square distribution with t degrees of freedom, in the limit as N goes to infinity, 
where pr is given in Eq. (4). 


6. [40] (H. Geiringer.) A frequency count of the first 2000 decimal digits in the 
representation of e = 2.71828... gave a y” value of 1.06, indicating that the actual 
frequencies of the digits 0, 1, ..., 9 are much too close to their expected values to be 
considered randomly distributed. (In fact, y? > 1.15 with probability 99.9 percent.) 
The same test applied to the first 10,000 digits of e gives the reasonable value y? = 8.61; 
but the fact that the first 2000 digits are so evenly distributed is still surprising. Does 
the same phenomenon occur in the representation of e to other bases? [See AMM 72 
(1965), 483-500.] 


7. [08] Apply the coupon collector’s test procedure (Algorithm C), with d = 3 and 
n = 7, to the sequence 1101221022120202001212201010201121. What lengths do the 
seven subsequences have? 


8. [M22] How many U’s need to be examined in the coupon collector’s test, on the 
average, before n complete sets have been found by Algorithm C, assuming that the 
sequence is random? What is the standard deviation? [Hint: See Eq. 1.2.9-(28).] 


9. [M21] Generalize the coupon collector’s test so that the search stops as soon as 
w distinct values have been found, where w is a fixed positive integer less than or equal 
to d. What probabilities should be used in place of (6)? 


10. [M23] Solve exercise 8 for the more general coupon collector’s test described in 
exercise 9. 


11. [00] The “runs up” in a particular permutation are displayed in (9); what are the 
“runs down” in that permutation? 


12. [20] Let Uo, Ui, ..., Un-1 be n distinct numbers. Write an algorithm that 
determines the lengths of all ascending runs in the sequence. When your algorithm 
terminates, COUNT[r] should be the number of runs of length r, for 1 < r < 5, and 
COUNT[6] should be the number of runs of length 6 or more. 


13. [M23] Show that (16) is the number of permutations of p+ q+1 distinct elements 
having the pattern (15). 
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> 14. [M15] If we “throw away” the element that immediately follows a run, so that 
when X; is greater than Xj+ı we start the next run with X;j+2, the run lengths are 
independent, and a simple chi-square test may be used (instead of the horribly compli- 
cated method derived in the text). What are the appropriate run-length probabilities 
for this simple run test? 


15. [M10] In the maximum-of-t test, why are Vj, Vf, ..., V;{_, supposed to be uni- 
formly distributed between zero and one? 


> 16. [15] Mr. J. H. Quick (a student) wanted to perform the maximum-of-t test for 
several different values of t. 

a) Letting Zjt = max(U;,Uj+1,...,Uj+4+-1), he found a clever way to go from the 
sequence Zo(z—1), Z1(t-1), ---, to the sequence Zor, Zıt, ..., using very little time 
and space. What was his bright idea? 

b) He decided to modify the maximum-of-t method so that the jth observation would 
be max(U;,...,U;4+-1); in other words, he took Vj = Zz instead of Vj = Zj) as 
the text says. He reasoned that all of the Z’s should have the same distribution, 
so the test is even stronger if each Z;:, 0 < j < n, is used instead of just every tth 
one. But when he tried a chi-square equidistribution test on the values of Vý, he 
got extremely high values of the statistic V, which got even higher as t increased. 
Why did this happen? 


17. [M25] Given any numbers Up,...,Un—1,Vo,---; Vn—1, let their mean values be 
_ 1 _ 1 
u = m 5 Uk, eee 5 Vp. 
0<k<n O<k<n 


a) Let U} = Up — ù, Vi = Vp — U. Show that the correlation coefficient C given in 
Eq. (24) is equal to 


5 ww DY URIS W 


O<k<n O<k<n O<k<n 


b) Let C = N/D, where N and D denote the numerator and denominator of the 
expression in part (a). Show that N? < D?, hence —1 < C < 1; and obtain a 
formula for the difference D? — N?. [Hint: See exercise 1.2.3-30.] 

c) If C= +1, show that aU; + BVk = T, O < k < n, for some constants a, 8, and T, 
not all zero. 


18. [M20] (a) Show that if n = 2, the serial correlation coefficient (23) is always equal 
to —1 (unless the denominator is zero). (b) Similarly, show that when n = 3, the serial 
correlation coefficient always equals —4. (c) Show that the denominator in (23) is zero 
if and only if Uo = U1 =--- = Un-1. 

19. [M30] (J. P. Butler.) Let Uo, ..., Un—1 be independent random variables having 
the same distribution. Prove that the expected value of the serial correlation coeffi- 
cient (23), averaged over all cases with nonzero denominator, is —1/(n — 1). 


20. [HM41] Continuing the previous exercise, prove that the variance of (23) is equal 
to n?/(n—1)?(n—2)—n3 E((Uo—U1)*/ D?) /2(n—2), where D is the denominator of (23) 
and E denotes the expected value over all cases with D # 0. What is the asymptotic 
value of E((Uo — U1)*/D?) when each Uj; is uniformly distributed? 

21. [19] What value of f is computed by Algorithm P if it is presented with the 
permutation (1, 2,9,8,5,3,6,7,0,4)? 
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22. [18] For what permutation of {0,1,2,3,4,5,6,7,8,9} will Algorithm P produce 
the value f = 1024? 


23. [M22] Let (Y,,) and (Y;) be integer sequences having period lengths A and A’, 
respectively, with 0 < Y„, Y„ < d; also let Zn = (Y, + Yn4,) mod d, where r is chosen 
at random between 0 and 4’ —1. Show that (Zn) passes the t-dimensional serial test at 
least as well as (Y,,) does, in the following sense: Let P(x1,...,v4) and Q(a1,..., 2+) 
be the probabilities that the t-tuple (#1,...,v¢) occurs in (Yn) and (Zn): 


dA-1 
1 
P(a1,..., 0+) = aA liane rs Mere) = (a1,...,24)]; 
1 A—1A'—1 
Q(z, sT) AN [((Zn,-++,Zn4t-1) = (£1, £t)]. 
n=0 r=0 


Then 5 (Q(a1,.-.., 24) — a < bs (P(a1,...,24) — any, 
(21,2) (215.2) 
24. [HM37| (G. Marsaglia.) Show that the serial test on n overlapping t-tuples 
(Yi, Yo,..., Yt) (Yo, ¥3,..., Yiri) ---, (Yn, Yi,- -, Yt—1) can be carried out as follows: 
For each string a = a1...@m with 0 < a: < d, let N(a) be the number of times 
a occurs as a substring of Y1Y2...Y¥nY1...Ym-—1, and let P(a) = P(a1)...P(am) be 
the probability that a occurs at any given position; individual digits may occur with 
differing probabilities P(0), P(1), ..., P(d—1). Compute the statistic 
l N (a)? 1 N (a)? 
v n 3 P(a) n È P(a) ` 


la|ļ=t Ja|=t-1 


Then V should have the chi-square distribution with d*—d‘~' degrees of freedom when 
n is large. [Hint: Use exercise 3.3.1—25.] 

25. [M46] Why is CIC CI} ~ —6Cy', when Cı and Co are the matrices defined 
after (22)? 


26. [HM30]| Let U1, U2, ..., Un be independent uniform deviates in [0..1), and let 
Ua) < Uy < +++ < Un) be their values after sorting; also define the spacings Sı = 
U2) — Vay, «++, Sn-1 = Uin) — Uin—1), Sn = Uga) + 1 — Un) and sorted spacings 


Sa) < +++ < Sin) as in the birthday spacings test. It is convenient in the following 
calculations to use the notation x% as an abbreviation for the expression x"[x > 0]. 


a) Given any real numbers s1, S2, ..., Sn, prove that the simultaneous inequalities 
Si > s1, S2 > $2,..., Sn > Sn occur with probability (1 — sı — s2 —---— nn als 


n-1 


b) Consequently the smallest spacing Sq) is < s with probability 1 — (1 — ns)" 
c) What are the distribution functions F,(s) = Pr(S(x) < s), for 1 < k < n? 
d) Calculate the mean and variance of each S(x). 


27. [HM26] (Iterated spacings.) In the notation of the previous exercise, show that 
the numbers Sj = nS(1), 93 = (n — 1)(S(2) — Sq), ---, Sh = 1(S(n) — S(n—1)) have 
the same joint probability distribution as the original spacings S1, ..., Sn of n uniform 
deviates. Therefore we can sort them into order, Sa) Sees Sin)» and repeat this 
transformation to get yet another set of random spacings SY, ..., Sh, etc. Each 
successive set of spacings 5, mes Si” can be subjected to the Kolmogorov-Smirnov 
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test, using 
Ky, =Vn—1 max ( J sf ane sf), 
1<j<n\n— 1 j 
m j—1 
Ky, =Vn—1 max (s® H- oe | 
1<j<n j n—1 
Examine the transformation from (S1,..., Sn) to (S${,...,S/,) in detail in the cases 


n = 2 and n = 3; explain why continued repetition of this process will break down 
eventually when it is applied to computer-generated numbers with finite precision. 
(One way to compare random number generators is to see how long they can continue 
to survive such a torture test.) 


28. [M26] Let bnrs(m) be the number of n-tuples (y1,--.,Yn) with 0 < yj < m that 
have exactly r equal spacings and s zero spacings. Thus, the probability that R = r 
in the birthday spacings test is JZ} bnrs(m)/m”. Also let pn(m) be the number of 
partitions of m into at most n parts (exercise 5.1.1-15). (a) Express bnoo(m) in terms 
of partitions. [Hint: Consider cases with small m and n.] (b) Show that there is a 
simple relation between bnrs(m) and bi,—s)(r+1—s)o(m) when s > 0. (c) Deduce an 
explicit formula for the probability that no spacings are equal. 

29. [M35] Continuing exercise 28, find simple expressions for the generating functions 
bnr(Z) = Z moo nro(m)z™/m, when r = 0, 1, and 2. 


30. [HM41] Continuing the previous exercises, prove that if m = n?/a we have 


m” teet 13a? 169aź + 2016a? — 1728Q? — 414720 
Pn(m) = ( 


if J =3 
n(n- DI\ 288n * 165888n2 Oe ) 


for fixed a as n > oo. Find a similar formula for gn(m), the number of partitions of m 
into n distinct positive parts. Deduce the asymptotic probabilities that the birthday 
spacings test finds R equal to 0, 1, and 2, to within O(1/n). 


31. [M21] The recurrence Yp = (Yn—24 + Yn-55) mod 2, which describes the least 
significant bits of the lagged Fibonacci generator 3.2.2—(7) as well as the second-least 
significant bits of 3.2.2-(7’), is known to have period length 2°°—1; hence every possible 
nonzero pattern of bits (Yn, Yn+1,---, Yn+54) occurs equally often. Nevertheless, prove 
that if we generate 79 consecutive random bits Yn, ..., Yn+7g starting at a random point 
in the period, the probability is more than 51% that there are more 1s than Os. If we use 
such bits to define a “random walk” that moves to the right when the bit is 1 and to the 
left when the bit is 0, we’ll finish to the right of our starting point significantly more than 
half of the time. [Hint: Find the generating function ya Pr(Yn +- -+Yn+7s = k) 2".] 
32. [M20] True or false: If X and Y are independent, identically distributed random 
variables with mean 0, and if they are more likely to be positive than negative, then 
X +Y is more likely to be positive than negative. 

33. [HM32] Find the asymptotic value of the probability that k + l consecutive bits 
generated by the recurrence Yn = (Yn-ı + Yn—x) mod 2 have more 1s than 0s, when 
k > 2l and the period length of this recurrence is 2” — 1, assuming that k is large. 

34. [HM29]| Explain how to estimate the mean and variance of the number of two- 
letter combinations that do not occur consecutively in a random string of length n 
on an m-letter alphabet. Assume that m is large and n ~% 2m?. 

35. [HM32] (J. H. Lindholm, 1968.) Suppose we generate random bits (Yn) using the 
recurrence 


Yn = (a1Yn-1 +a2Yn-2 +: + G6 View) mod 2 r 
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for some choice of a1, ..., ap such that the period length is 2% — 1; start with Yo = 1 
and Yı =::- = Yp—ı = 0. Let Zn = (i)a = 2Y, — 1 be a random sign, and 
consider the statistic Sm = Zn + Zn+1 + <- + Zn+m-1, where n is a random point in 
the period. 

a) Prove that E Sm = m/N, where N = 2 — 1. 

b) What is E S2,? Assume that m < N. Hint: See exercise 3.2.2-16. 

c) What would E Sm and E S2, be if the Z’s were truly random? 

d) Assuming that m < N, prove that E S3, = m3/N — 6B(N + 1)/N, where 


B= 5 [(Yi+1Yi+2. . . Yişk-1)2 = (Yj+1Yj+2 . - . Yj+k-1)2] (m — j). 


O<i<j<m 


e) Evaluate B in the special case considered in exercise 31: m = 79 and Yn = 
(Yn-24 + Yn—55) mod 2. 


*3.3.3. Theoretical Tests 


Although it is always possible to test a random number generator using the 
methods in the previous section, it is far better to have a priori tests: theoretical 
results that tell us in advance how well those tests will come out. Such theoretical 
results give us much more understanding about the generation methods than 
empirical, trial-and-error results do. In this section we shall study the linear 
congruential sequences in more detail; if we know what the results of certain 
tests will be before we actually generate the numbers, we have a better chance 
of choosing a, m, and c properly. 

The development of this kind of theory is quite difficult, although some 
progress has been made. The results obtained so far are generally for statistical 
tests made over the entire period. Not all statistical tests make sense when they 
are applied over a full period — for example, the equidistribution test will give 
results that are too perfect — but the serial test, gap test, permutation test, 
maximum test, etc., can be fruitfully analyzed in this way. Such studies will 
detect global nonrandomness of a sequence, that is, improper behavior in very 
large samples. 

The theory we shall discuss is quite illuminating, but it does not eliminate 
the need for testing local nonrandomness by the methods of Section 3.3.2. Indeed, 
the task of proving anything useful about short subsequences appears to be very 
hard. Only a few theoretical results are known about the behavior of linear 
congruential sequences over less than a full period; they will be discussed at the 
end of Section 3.3.4. (See also exercise 18.) 

Let us begin with a proof of a simple a priori law, for the least complicated 
case of the permutation test. The gist of our first theorem is that we have 
Xn+1 < Xn about half the time, provided that the sequence has high potency. 


Theorem P. Let a, c, and m generate a linear congruential sequence with 
maximum period; let b = a — 1 and let d be the greatest common divisor of m 
and b. The probability that Xn+1 < Xn is equal to 4 +r, where 


r = (2(c mod d) — d) /2m; (1) 
hence |r| < d/2m. 
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Proof. The proof of this theorem involves some techniques that are of interest 
in themselves. First we define 
s(x) = (ax + c) mod m. (2) 


Thus, Xn4+1 = s(Xn), and the theorem reduces to counting the number of 
integers x such that 0 < x < m and s(x) < x, since every such integer occurs 
somewhere in the period. We want to show that this number is 


$(m + 2(cmod d) — d). (3) 


The function | (x — s(x))/m] is equal to 1 when z > s(x), and it is 0 
otherwise; hence the count we wish to obtain can be written simply as 


A le ae ae || 


0<a<m O0<a<m 
ax+c ba +c 
DO lalla W 
m m 
O<r<m 


(Recall that [—y] = —|y] and b = a — 1.) Such sums can be evaluated by the 
method of exercise 1.2.4-37, where we have proved that 


3 | oe g- 


1 
5 t +glc/g], g =gcd(h, k), (5) 
O<j<k 


whenever h and k are integers and k > 0. Since a is relatively prime to m, this 
formula yields 


3 eee] E Caius) P 
( 


O<r<m 
ba + b-1 -1 d-1 

2, i ila mat) + A +e- (cmod a), 
m 2 2 

O0<a<m 


and (3) follows immediately. | 


The proof of Theorem P indicates that a priori tests can indeed be carried 
out, provided that we are able to deal satisfactorily with sums involving the | | 
and | | functions. In many cases the most powerful technique for dealing with 
floor and ceiling functions is to replace them by two somewhat more symmetrical 
operations: 


(x) = |x| +1 — [a] = [x is an integer]; (6) 
((z)) =2— lz] — 5+ 96(x) =x- [z] + 3 - 38l) = r ihle] + fel) (7) 


The latter function is a “sawtooth” function familiar in the study of Fourier 
series; its graph is shown in Fig. 7. The reason for choosing to work with ((x)) 
rather than || or [x] is that ((#)) possesses several very useful properties: 


((=2)) =—(()); (8) 
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Fig. 7. The sawtooth function ((z)). 


((t+n))=((#)), integer n; (9) 
((nz)) = ((x))ṣ4 («3 ~)) bees (24 a) integer n> 1. (10) 


(See exercises 1.2.4-38 and 1.2.4-39(a,b,g).) 

In order to get some practice working with these functions, let us prove 
Theorem P again, this time without relying on exercise 1.2.4-37. With the help 
of Eqs. (7), (8), (9), we can show that 


Sa Gece (E) E) 
T= | (==) | ; (11) 


since (a — s(x))/m is never an integer. Now 


Tata 


| 
8 
| 
3] 
8 
< 
PO 
"a 
8 
| 
= 
3/8 
+ 
5 
Sw” 
NK 
+ 
. 


0<a<m 
since both x and s(x) take on each value of {0,1,..., m — 1} exactly once; hence 
(11) yields 
x —s(x)| _ ba +c m 
rae lea) te 
O0<a<m O0<a<m 


Let b = bod, m = mod, where bp and mo are relatively prime. We know that 
(boz) mod mo takes on the values {0, 1, ..., Mo — 1} in some order as x varies 
from 0 to mo — 1. By (9) and (10) and the fact that 


(e) = (E) 
A ae me) 


0<ar<m 


we have 
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GOROR 


0<a2<mo 


Theorem P follows immediately from (12) and (13). 

One consequence of Theorem P is that practically any choice of a and c will 
give a reasonable probability that Xn+1 < Xn, at least over the entire period, 
except those that have large d. A large value of d corresponds to low potency, 
and we already know that generators of low potency are undesirable. 

The next theorem gives us a more stringent condition for the choice of the 
parameters a and c; we will consider the serial correlation test applied over the 
entire period. The quantity C defined in Section 3.3.2, Eq. (23), is 


c= (mS 2 z (E+) ) / PE- (E+) ) ai 


Let x’ be the element such that s(x’) = 0. We have 


KS = m((==])) + Tieg] Gnd) 


The formulas we are about to derive can be expressed most easily in terms of 


the sum E j TEE 
EEOC O ow 


an important function that arises in several mathematical problems. It is called 
a generalized Dedekind sum, since Richard Dedekind introduced the function 
o(h, k,0) in 1876 when commenting on one of Riemann’s incomplete manuscripts. 
[See B. Riemann’s Gesammelte math. Werke, 2nd ed. (1892), 466-478. | 

Using the well-known formulas 


m(m — 1) 2  m(m— 4)(m—1) 
5 r= = and 5 gr’ = 3 ; 
O<r<m O<r<m 


it is a straightforward matter to transform Eq. (14) into 


C= mo(a,m,c)—3+6(m-— a! — c) 


(17) 


(See exercise 5.) Since m is usually very large, we may discard terms of order 
1/m, and we have the approximation 


C x a(a,m,c)/m, (18) 


m2 —1 


with an error of less than 6/m in absolute value. 

The serial correlation test now reduces to determining the value of the 
Dedekind sum o(a,m,c). Evaluating o(a,m,c) directly from its definition (16) 
is hardly any easier than evaluating the correlation coefficient itself directly, but 
fortunately there are simple methods available for computing Dedekind sums 
quite rapidly. 
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Lemma B (‘Reciprocity law” for Dedekind sums). Let h, k, c be integers. If 
0<c<k,0< h< k, and if h is relatively prime to k, then 


h k 1 7 
o(h,k,c) +o(k,h,c) = = + Be | =| 3e(h, c), (19) 


k h a hk i hk 
where 
e(h,c) = [c=0] + [c mod h #0]. (20) 


Proof. We leave it to the reader to prove that, under these hypotheses, 


6 2 
o(h, kc) + a(k, h, c) = a(h, k, 0) + a(k, h,0) + © —6 | 


(See exercise 6.) The lemma now must be proved only in the case c = 0. 

The proof we will give, based on complex roots of unity, is essentially due 
to L. Carlitz. There is actually a simpler proof that uses only elementary 
manipulations of sums (see exercise 7)— but the following method reveals more 
of the mathematical tools that are available for problems of this kind and it is 
therefore much more instructive. 

Let f(x) and g(x) be polynomials defined as follows: 


f(æ)=1+r+ ta = (a -1/0 1) 
(k — 1)z®7! (22) 


| — 8e(h,c) +3. (21) 


g(z) =2+227? +- 
= g f'(x) = kz"/(x — 1) — e(z" —1)/(2 —1)?. 


( 

If w is the complex kth root of unity e?7*/*, we have by Eq. 1.2.9-(13) 

1 ; l 

7 5 w I" glwt r) = rr", if0<r<k. (23) 

O<j<k 
Set x = 1; then glws) = k/(wf — 1) if j 4 0, otherwise it equals k(k — 1)/2. 
Therefore 
—jr 
r mod k = D athe J; if r is an integer. 
0<j<k 


(Eq. (23) shows that the right-hand side equals r when 0 < r < k, and it is 
unchanged when multiples of k are added to r.) Hence 


Orar « 


j<k 


This important formula, which holds whenever r is an integer, allows us to reduce 
many calculations involving ((r/k)) to sums involving kth roots of unity, and it 
brings a whole new range of techniques into the picture. In particular, we get 
the following formula when A L k: 


3(k—1) 12 wi wih 


= Sa Se ee n, A 2 
k2 k2 - ~_ «wt—-Lw-1 (25) 
O<r<k 0<i<k 0<j<k 


o(h,k,0) + 
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The right-hand side of this formula may be simplified by carrying out the sum 
on r; we have )ij<,c,W"* = f(w*) = 0 if smodk # 0. Equation (25) now 
reduces to 

3(k—1) 12 1 


k k a @ Dw) 


a(h,k,0) + 


(26) 


A similar formula is obtained for o(k,h,0), with ¢ = e?7*/” replacing w. 

It is not obvious what we can do with the sum in (26), but there is an elegant 
way to proceed, based on the fact that each term of the sum is a function of wf, 
where 0 < 7 < k; hence the sum is essentially taken over the kth roots of unity 


other than 1. Whenever 71, £2, ..., £n are distinct complex numbers, we have 
the identity 
> ! 
(£j — 21) --- (£j — Zj—1)(£ — 25) (@j — Bj41)... (£j — En) 


1 
= (w—21)...(@— En) (27) 
which follows from the usual method of expanding the right-hand side into partial 
fractions. Moreover, if q(x) = (a — y1 )(x — ye)... (£ — Ym), we have 


a (yg) = (yj — yn) --- (yj — Yj-1) (Ys — Yj+1) --- (Yj — Ym); (28) 


this identity may often be used to simplify expressions like those in the left- 
hand side of (27). When h and k are relatively prime, the numbers w, w?, ..., 
wk! ¢, ¢7,..., PTL are all distinct; we can therefore consider formula (27) in 
the special case of the polynomial (x —w)...(z —w*®7!)(2@—¢)...(2-—¢P) = 
(x — 1)(x* — 1)/(a — 1)?, pau the following identity in <: 


1 see wi ( Gat O0 (@=1)? 
A (Cak — (x—ċi) TE, =>) (wah — (x— wi) (a@®—1)(a*—1)" (29) 


rer <h 


This identity has many ese consequences, and it leads to numerous reci- 
procity formulas for sums of the type given in Eq. (26). For example, if we 
differentiate (29) twice with respect to x and let x > 1, we find that 
2 > ¢3(¢7 — 1)? 2 uw (a — 1)? 
aaa hope DA ee 


0<j<h 


Replace j by h — j and by k — j in these sums and use (26) to get 


: (ct. h,0) + o) + ; (oro) + 2) 


2h 2k’ 
which is equivalent to the desired result. J 
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Lemma B gives us an explicit function f(h, k,c) such that 
o(h, k,c) = f(h, k,c) — o(k,h,c) (30) 
whenever 0 < h < k, 0 < c < k, and h is relatively prime to k. From the 
definition (16) it is clear that 
o(k,h,c) = o(k mod h, h, cmod h). (31) 
Therefore we can use (30) iteratively to evaluate c(h, k, c), using a process that 
reduces the parameters as in Euclid’s algorithm. 


Further simplifications occur when we examine this iterative procedure more 
closely. Let us set mı = k, m2 = h, cy = c, and form the following tableau: 


mı = aymg2 + m3 C1 = bimo + C2 
Mz = agm3 + M4 c2 = bom3 + C3 (32) 
ms = az3m4 + ms c3 = b3m4 + c4 
m4 = agms c4 = bams + c5 
Here 
aj = [mj /mj41], bj = [cj /Mj+1], 
(33) 
Mj+2 = Mj mod Mj+i, Cj+1 = Cj mod Mi+1; 
and it follows that 
0 < mj41 < Mj, 0< cj < mj. (34) 


We have assumed for convenience that Euclid’s algorithm terminates in (32) 
after four iterations; this assumption will reveal the pattern that holds in the 
general case. Since h and k were relatively prime to start with, we must have 
ms = 1 and cs = 0 in (32). 

Let us assume also that c3 4 0 but c4 = 0, in order to get a feeling for the 
effect this has on the recurrence. Equations (30) and (31) yield 


o(h,k,c) = o(me,m1, c1) 


= f(m2,m1,c1) — (m3, M2, C2) 


= f(m, mı, ¢1) T f(m3, M2, C2) T f(m4, Mmg, c3) B f(ms, m4, c4). 


The first part, h/k + k/h, of the formula for f(h, k,c) in (19) contributes 


to the total, and this simplifies to 


h mı- m3 m= M4, mMm3-ms5, ma h 
T T == 


k me m3 ma ms, k 


+ aı — a2 + a3 — a4. 
The next part of (19), 1/hk, also leads to a simple contribution; according to 
Eq. 4.5.3-(g) and other formulas in Section 4.5.3, we have 


1 1 1 1 h! 
+ Sls (35) 


mMm mm3 ™3™M4 mM4™Ms 
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where h’ is the unique integer satisfying 
h'h=1 (modulo k), O0<h' <k. (36) 


Adding up all the contributions, and remembering our assumption that c4 = 0 

(so that e(m4, c3) = 0, see (20)), we find that 

h+k 
k 


a(h, k, c) = + (ai — ag + a3 a4) 6(bı bg + b3 b4) 


2 
C C C C 
2 4 
+ 6 1 L 3 H2, 
myms mM2mM3 m3m4 mams 


in terms of the assumed tableau (32). Similar results hold in general: 


Theorem D. Let h, k, c be integers with 0 < h < k, 0 < c < k, and h relatively 
prime to k. Form the “Euclidean tableau” as defined in (33) above, and assume 
that the process stops after t steps with mz41 = 1. Let s be the smallest subscript 
such that cs = 0, and let h’ be defined by (36). Then 


h+k j+1 Cj 
1<jst en 
+3((-1)° + ôs) -2+ (—1). 1 
Euclid’s algorithm is analyzed carefully in Section 4.5.3; the quantities aj, 
a2, ..-, Q, are called the partial quotients of h/k. Theorem 4.5.3F tells us that 
the number of iterations, t, will never exceed log, k; hence Dedekind sums can 


be evaluated rapidly. The terms c? /Mmjmj+ı can be simplified further, and 
an efficient algorithm for evaluating o(h,k,c) appears in exercise 17. 


Now that we have analyzed generalized Dedekind sums, let us apply our 
knowledge to the determination of serial correlation coefficients. 


Example 1. Find the serial correlation when m = 235, a = 234 +1, c= 1. 
Solution. We have 

C = (2 0(2*4 +1, 2%, 1) — 3 + 6(2% — (274 — 1) — 1))/(2 — 1), 
by Eq. (17). To evaluate o(2°4 + 1, 235,1), we can form the tableau 


mı = 235 = l 

m = 23441 a= 1 C&2 = 1 b =0 
m3 = 234-1 az= 1 c3 = 1 bz = 0 
m4 = 2 az = 233 — 1 c4=1 b3 = 0 
ms =1 a4 = 2 c5 = 0 bg =1 


Since h’ = 234 + 1, the value according to Theorem D comes to 233 — 3 + 2732, 
Thus 
C = (268 4+ 5)/Q2™"-1) = + +E, lel oe", (37) 


Such a correlation is much, much too high for randomness. Of course, this 
generator has very low potency, and we have already rejected it as nonrandom. 
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Example 2. Find the approximate serial correlation when m = 10!°, a = 10001, 
c = 2113248653. 


Solution. We have C ~ a(a,m,c)/m, and the computation proceeds as follows: 


m = 10000000000 cı = 2113248653 

m = 10001 a, = 999900 c= 7350 bı = 211303 

m3 = 100 a= 100 c= 50 b= 73 

m4 = 1 as= 100 c= 0 b= 50 
a(m2,mı, c1) = —31.6926653544, Cx —3-107%. (38) 


This is a very respectable value of C indeed. But the generator has a potency 
of only 3, so it is not really a very good source of random numbers in spite of 
the fact that it has low serial correlation. It is necessary to have a low serial 
correlation, but not sufficient. 


Example 3. Estimate the serial correlation for general a, m, and c. 


Solution. If we consider just one application of (30), we have 


a(a,m,c) S Z +6 6 a(m,a,c). 


Now |o(m,a,c)| < a by exercise 12, and therefore 


ow Em * (1 6— -6(=)). (39) 


m a 


The error in this approximation is less than (a + 6)/m in absolute value. 

The estimate in (39) was the first theoretical result known about the random- 
ness of congruential generators. R. R. Coveyou [JACM 7 (1960), 72-74] obtained 
it by averaging over all real numbers x between 0 and m instead of considering 
only the integer values (see exercise 21); then Martin Greenberger [Math. Comp. 
15 (1961), 383-389] gave a rigorous derivation including an estimate of the 
error term. 

So began one of the saddest chapters in the history of computer science! 
Although the approximation above is quite correct, it has been grievously mis- 
applied in practice; people abandoned the perfectly good generators they had 
been using and replaced them by terrible generators that looked good from the 
standpoint of (39). For more than a decade, the most common random number 
generators in daily use were seriously deficient, solely because of a theoretical 
advance. 


A little Learning is a dang’rous Thing. 
— ALEXANDER POPE, An Essay on Criticism, 215 (1711) 


If we are to learn by past mistakes, we had better look carefully at how (39) 
has been misused. In the first place people assumed uncritically that a small 
serial correlation over the whole period would be a pretty good guarantee of 
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randomness; but in fact it doesn’t even ensure a small serial correlation for 1000 
consecutive elements of the sequence (see exercise 14). 

Secondly, (39) and its error term will ensure a relatively small value of C only 
when a ~ ,/m; therefore people suggested choosing multipliers near ym. In fact, 
we shall see that nearly all multipliers give a value of C that is substantially less 
than 1/./m, hence (39) is not a very good approximation to the true behavior. 
Minimizing a crude upper bound for C does not minimize C. 

In the third place, people observed that (39) yields its best estimate when 


c/m ~ 4 4v3, (40) 


since these values are the roots of 1 — 6x + 6x? = 0. “In the absence of any other 
criterion for choosing c, we might as well use this one.” The latter statement 
is not incorrect, but it is misleading at best, since experience has shown that 
the value of c has hardly any influence on the true value of the serial correlation 
when a is a good multiplier; the choice (40) reduces C substantially only in cases 
like Example 2 above. And we are fooling ourselves in such cases, since the bad 
multiplier will reveal its deficiencies in other ways. 

Clearly we need a better estimate than (39); and such an estimate is now 
available thanks to Theorem D, which stems principally from the work of Ulrich 
Dieter [Math. Comp. 25 (1971), 855-883]. Theorem D implies that o(a, m, c) 
will be small if the partial quotients of a/m are small. Indeed, by analyzing 
generalized Dedekind sums still more closely, it is possible to obtain quite a 
sharp estimate: 


Theorem K. Under the assumptions of Theorem D, we always have 


jo dd jeven jodd j even 


Proof. See D. E. Knuth, Acta Arithmetica 33 (1977), 297-325, where it is 
shown further that these bounds are essentially the best possible when large 
partial quotients are present. J 


Example 4. Estimate the serial correlation for a = 3141592621, m = 235, 
c odd. 


Solution. The partial quotients of a/m are 10, 1, 14, 1, 7, 1, 1, 1, 3, 3, 3, 5, 2, 
1, 8, 7, 1, 4, 1, 2, 4, 2; hence by Theorem K 


—55 < a(a,m,c) < 67.5, 


and the serial correlation is guaranteed to be extremely low for all c. 

Note that this bound is considerably better than we could obtain from (39), 
since the error in (39) is of order a/m; our “random” multiplier has turned out 
to be much better than one specifically chosen to look good on the basis of (39). 
In fact, it is possible to show that the average value of Sy aj, taken over all 
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multipliers a relatively prime to m, is 

6 

2 (In m)? + O((log m) (log log m)*) 


(see exercise 4.5.3-35). Therefore the probability that a random multiplier has 
large x aj, say larger than (logm)?*© for some fixed € > 0, approaches 
zero as m — co. This substantiates the empirical evidence that almost all 
linear congruential sequences have extremely low serial correlation over the entire 
period. 

The exercises below show that other a priori tests, such as the serial test over 
the entire period, can also be expressed in terms of a few generalized Dedekind 
sums. It follows from Theorem K that linear congruential sequences will pass 
those tests provided that certain specified fractions (depending on a and m but 
not on c) have small partial quotients. In particular, the result of exercise 19 
implies that the serial test on pairs will be passed satisfactorily if and only if 
a/m has no large partial quotients. 

The book Dedekind Sums by Hans Rademacher and Emil Grosswald (Math. 
Assoc. of America, Carus Monograph No. 16, 1972) discusses the history and 
properties of Dedekind sums and their generalizations. Further theoretical tests, 
including the serial test in higher dimensions, are discussed in Section 3.3.4. 


EXERCISES — First Set 
1. [M10] Express x mod y in terms of the sawtooth and 6 functions. 


2. [HM22] What is the Fourier series expansion (in terms of sines and cosines) of 
the function ((x))? 


3. [M23] (N. J. Fine.) Prove that |Ð p20 ((2"x + 4))| < 1 for all real numbers z. 


> 4. [M19] If m = 101°, what is the highest possible value of d (in the notation of 
Theorem P), given that the potency of the generator is 10? 


5. [M21] Carry out the derivation of Eq. (17). 


6. [M27] Assume that hh’ + kk’ = 1. 
a) Show, without using Lemma B, that 


o(h,k,c) =o(h,k,0) +12 X (EE) +e((Ż5) 


O0<j<e 


for all integers c > 0. 


b) Show that (Ei) j (EH) = 1 E) if0<j<k. 


c) Under the assumptions of Lemma B, prove Eq. (21). 


> 7. [M24] Give a proof of the reciprocity law (19), when c = 0, by using the general 
reciprocity law of exercise 1.2.4—45. 


> 8. [M34] (L. Carlitz.) Let 


an= 5 (ENE) 


O0<j<r 
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By generalizing the method of proof used in Lemma B, prove the following beautiful 
identity due to H. Rademacher: If each of p,q,r is relatively prime to the other two, 
Boe to Oe 


P, q,r T q,r, p ga r, p,q) = 1 T 
Par) +plar p) + plpa) = rt oe 


(The reciprocity law for Dedekind sums, with c = 0, is the special case r = 1.) 


9. [M40] Is there a simple proof of Rademacher’s identity (exercise 8) along the lines 
of the proof in exercise 7 of a special case? 


10. [M20] Show that when 0 < h < k it is possible to express o(k — h, k, c) and 
a(h,k,—c) easily in terms of o(h,k,c). 
11. [M30] The formulas given in the text show us how to evaluate o(h,k,c) when h 
and k are relatively prime and c is an integer. For the general case, prove that 

a) o(dh, dk, dc) = o(h,k,c), for integer d > 0; 

b) o(h, k, c+ 0) = o(h,k,c) + 6((h'c/k)), for integer c, real 0 < 0 < 1, h L k, and 

hh' = 1 (modulo k). 

12. [M24] Show that if h is relatively prime to k and c is an integer, |o(h,k,c)| < 
(k — 1)(k — 2)/k. 
13. [M24] Generalize Eq. (26) so that it gives an expression for ø (h, k, c). 
14. [M20] The linear congruential generator that has m = 2% a = 28 4+1, c= 1, 
was given the serial correlation test on three batches of 1000 consecutive numbers, and 


the result was a very high correlation, between 0.2 and 0.3, in each case. What is the 
serial correlation of this generator, taken over all 2?° numbers of the period? 


15. [M21] Generalize Lemma B so that it applies to all real values of c, 0 < c < k. 
16. [M24] Given the Euclidean tableau defined in (33), let po = 1, pı = ai, and 
pj = ajpj—1 + pj—2 for 1 < j < t. Show that the complicated portion of the sum 
in Theorem D can be rewritten as follows, making it possible to avoid noninteger 
computations: 


fe 2 ç _ 1 j+1 
> (-) SS le ea) pa 


mm; m 
1<j<t JUIF l i<j<t 


[Hint: Prove that Diejer(—1)P mm = (-1)"""pp-1/mimryi for 1 <r < t] 


17. [M22] Design an algorithm that evaluates o(h,k,c) for integers h, k, c satisfying 
the hypotheses of Theorem D. Your algorithm should use only integer arithmetic (of 
unlimited precision), and it should produce the answer in the form A+ B/k where A 
and B are integers. (See exercise 16.) If possible, use only a finite number of variables 
for temporary storage, instead of maintaining arrays such as a1, ao, ..., Gt. 


18. [M23] (U. Dieter.) Given positive integers h, k, z, let 


S(h,k,¢,z)= >> (“4 *)). 


O0<j<z 


Show that this sum can be evaluated in closed form, in terms of generalized Dedekind 
sums and the sawtooth function. [Hint: When z < k, the quantity |j/k] — |(j — z)/k| 
equals 1 for 0 < j < z, and it equals 0 for z < j < k, so we can introduce this factor 
and sum over 0 < j < k.] 
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> 19. [M23] Show that the serial test can be analyzed over the full period, in terms of 
generalized Dedekind sums, by finding a formula for the probability that a < Xn < 8 
and a! < Xn4y1 < 8’ when a, 8, a’, B’ are given integers with 0 <a < B < m and 
0<a’ <p’ <™m. [Hint: Consider the quantity | (x — a)/m] — |(x — B)/m].] 
20. [M29] (U. Dieter.) Extend Theorem P by obtaining a formula for the probability 
that Xn > Xn41 > Xn+2, in terms of generalized Dedekind sums. 


EXERCISES — Second Set 


In many cases, exact computations with integers are quite difficult to carry out, but 
we can attempt to study the probabilities that arise when we take the average over all 
real values of x instead of restricting the calculation to integer values. Although these 
results are only approximate, they shed some light on the subject. 

It is convenient to deal with numbers Un between zero and one; for linear congru- 
ential sequences, Un = Xn/m, and we have Un+i = {aUn + 0}, where 6 = c/m and 
{x} denotes x mod 1. For example, the formula for serial correlation now becomes 


C= ( [| Hae + 0) dz- (f ræ) ) /( fora (f æ) ), 


21. [HM23] (R. R. Coveyou.) What is the value of C in the formula just given? 


22. [M22] Let a be an integer, and let 0 < 0 < 1. If x is a random real number, 
uniformly distributed between 0 and 1, and if s(x) = {ax + 0}, what is the probability 
that s(x) < x? (This is the “real number” analog of Theorem P.) 


23. [M28] The previous exercise gives the probability that Un+ı < Un. What is 
the probability that Un+2 < Un+1 < Un, assuming that Un is a random real number 
between zero and one? 


vy 


24. [M29] Under the assumptions of the preceding problem, except with 0 = 0, show 
that Un > Un4i > +++ > Un4t-1 occurs with probability 


1 1 — 
TERESE] 
t! a a 


What is the average length of a descending run starting at Un, assuming that Un is 
selected at random between zero and one? 


> 25. [M25] Let a, B, a’, B’ be real numbers withO<a<6<1,0<a' <6’ <1. 
Under the assumptions of exercise 22, what is the probability that a < x < 6 and 
a’ < s(x) < 6’? (This is the “real number” analog of exercise 19.) 


26. [M21] Consider a “Fibonacci” generator, where Un+1 = {Un + Un-1}. Assuming 
that U; and U2 are independently chosen at random between 0 and 1, find the proba- 
bility that Uy < U2 < U3, U1 < U3 < U2, U2 < Ui < Us, ete. [| Hint: Divide the unit 
square {(x,y) | 0 < x,y < 1} into six parts, depending on the relative order of z, y, 
and {x + y}, and determine the area of each part.] 


27. [M32] In the Fibonacci generator of the preceding exercise, let Up and U, be cho- 
sen independently in the unit square except that Up > U;. Determine the probability 
that Uı is the beginning of an upward run of length k, so that Up > U1 <--- < Ux > 
Uk+1ı. Compare this with the corresponding probabilities for a random sequence. 


28. [M35] According to Eq. 3.2.1.3-(5), a linear congruential generator with potency 2 
satisfies the condition Xn—1—2Xn+Xn+41 = (a—1)c (modulo m). Consider a generator 
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that abstracts this situation: Let Unda = {a + 2Un —Un_-1i}. As in exercise 26, divide 
the unit square into parts that show the relative order of U1, U2, and U3 for each pair 
(U1, U2). Are there any values of a for which all six possible orders are achieved with 
probability t, assuming that U; and U2 are chosen at random in the unit square? 


3.3.4. The Spectral Test 


In this section we shall study an especially important way to check the quality of 
linear congruential random number generators. Not only do all good generators 
pass this test, all generators now known to be bad actually fail it. Thus it 
is by far the most powerful test known, and it deserves particular attention. 
Our discussion will also bring out some fundamental limitations on the degree 
of randomness that we can expect from linear congruential sequences and their 
generalizations. 

The spectral test embodies aspects of both the empirical and theoretical 
tests studied in previous sections: It is like the theoretical tests because it deals 
with properties of the full period of the sequence, and it is like the empirical 
tests because it requires a computer program to determine the results. 


A. Ideas underlying the test. The most important randomness criteria seem 
to rely on properties of the joint distribution of t consecutive elements of the 
sequence, and the spectral test deals directly with this distribution. If we have 
a sequence (Un) of period m, the basic idea is to analyze the set of all m points 


{ (Un; Un4+1,--+;Un+e-1) |O0<n<m} (1) 


in t-dimensional space. 

For simplicity we shall assume that we have a linear congruential sequence 
(Xo,a,¢,m) of maximum period length m (so that c 4 0), or that m is prime 
and c = 0 and the period length is m — 1. In the latter case we shall add the 
point (0,0,...,0) to the set (1), so that there are always m points in all; this 
extra point has a negligible effect when m is large, and it makes the theory much 
simpler. Under these assumptions, (1) can be rewritten as 


{=(2, s(x), s(s(a)),..., s’-"(z)) | 0O<zr< m}, (2) 


where 
s(x) = (ax + c) mod m (3) 


is the successor of x. We are considering only the set of all such points in t 
dimensions, not the order in which those points are actually generated. But the 
order of generation is reflected in the dependence between components of the 
vectors; and the spectral test studies such dependence for various dimensions t 
by dealing with the totality of all points (2). 

For example, Fig. 8 shows a typical small case in 2 and 3 dimensions, for 
the generator with 


s(x) = (137x + 187) mod 256. (4) 
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Of course a generator with period length 256 will hardly be random, but 256 is 
small enough that we can draw the diagram and gain some understanding before 
we turn to the larger m’s that are of practical interest. 

Perhaps the most striking thing about the pattern of boxes in Fig. 8(a) is 
that we can cover them all by a fairly small number of parallel lines; indeed, 
there are many different families of parallel lines that will hit all the points. For 
example, a set of 20 nearly vertical lines will do the job, as will a set of 21 lines 
that tilt upward at roughly a 30° angle. We commonly observe similar patterns 
when driving past farmlands that have been planted in a systematic manner. 

If the same generator is considered in three dimensions, we obtain 256 points 
in a cube, obtained by appending a “height” component s(s(a)) to each of the 
256 points (x, s(z)) in the plane of Fig. 8(a), as shown in Fig. 8(b). Let’s imagine 
that this 3-D crystal structure has been made into a physical model, a cube that 
we can turn in our hands; as we rotate it, we will notice various families of 
parallel planes that encompass all of the points. In the words of Wallace Givens, 
the random numbers stay “mainly in the planes.” 

At first glance we might think that such systematic behavior is so nonrandom 
as to make congruential generators quite worthless; but more careful reflection, 
remembering that m is quite large in practice, provides a better insight. The 
regular structure in Fig. 8 is essentially the “grain” we see when examining 
our random numbers under a high-power microscope. If we take truly random 
numbers between 0 and 1, and round or truncate them to finite accuracy so 
that each is an integer multiple of 1/v for some given number v, then the t- 
dimensional points (1) we obtain will have an extremely regular character when 
viewed through a microscope. 

Let 1/v2 be the maximum distance between lines, taken over all families 
of parallel straight lines that cover the points {(x/m, s(x)/m)} in two dimen- 
sions. We shall call və the two-dimensional accuracy of the random number 
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generator, since the pairs of successive numbers have a fine structure that is 
essentially good to one part in v2. Similarly, let 1/v3 be the maximum distance 
between planes, taken over all families of parallel planes that cover all points 
{(a/m, s(x)/m, s(s(x))/m)}; we shall call v3 the accuracy in three dimensions. 
The t-dimensional accuracy 1 is the reciprocal of the maximum distance between 
hyperplanes, taken over all families of parallel (t — 1)-dimensional hyperplanes 
that cover all points {(x/m, s(x)/m, ..., sl-"(x)/m)}. 

The essential difference between periodic sequences and truly random se- 
quences that have been truncated to multiples of 1/v is that the accuracy of 
truly random sequences is the same in all dimensions, while that of periodic 
sequences decreases as t increases. Indeed, since there are only m points in the 
t-dimensional cube when m is the period length, we can’t achieve a t-dimensional 
accuracy of more than about m!/*. 

When the independence of t consecutive values is considered, computer- 
generated random numbers will behave essentially as if we took truly random 
numbers and truncated them to lg v bits, where v decreases with increasing t. 
In practice, such varying accuracy is usually all we need. We don’t insist that the 
10-dimensional accuracy be 2%”, in the sense that all (2°7)'° possible 10-tuples 
(Un, Un41,---;Un49) should be equally likely on a 32-bit machine; for such large 
values of t we want only a few of the leading bits of (Un, Un41,...,Un+t-1) to 
behave as if they were independently random. 

On the other hand when an application demands high resolution of the 
random number sequence, simple linear congruential sequences will necessarily 
be inadequate. A generator with longer period should be used instead, even 
though only a small fraction of the period will actually be generated. Squaring 
the period length will essentially square the accuracy in higher dimensions; that 
is, it will double the effective number of bits of precision. 

The spectral test is based on the values of v for small t, say 2 < t < 6. 
Dimensions 2, 3, and 4 seem to be adequate to detect important deficiencies 
in a sequence, but since we are considering the entire period it is wise to be 
somewhat cautious and go up into another dimension or two; on the other hand 
the values of v for t > 10 seem to be of no practical significance whatever. (This 
is fortunate, because it appears to be rather difficult to calculate the accuracy 
precisely when t > 10.) 

There is a vague relation between the spectral test and the serial test; for 
example, a special case of the serial test, taken over the entire period as in exercise 
3.3.3-19, counts the number of boxes in each of 64 subsquares of Fig. 8(a). The 
main difference is that the spectral test rotates the dots so as to discover the 
least favorable orientation. We shall return to the serial test later in this section. 


It may appear at first that we should apply the spectral test only for one 
suitably high value of t; if a generator passes the test in three dimensions, it seems 
plausible that it should also pass the 2-D test, hence we might as well omit the 
latter. The fallacy in this reasoning occurs because we apply more stringent 
conditions in lower dimensions. A similar situation occurs with the serial test: 


96 RANDOM NUMBERS 3.3.4 


Consider a generator that (quite properly) has almost the same number of points 
in each subcube of the unit cube, when the unit cube has been divided into 64 
subcubes of size 4 x + x 4; this same generator might yield completely empty 
subsquares of the unit square, when the unit square has been divided into 64 
subsquares of size i x Z. Since we increase our expectations in lower dimensions, 
a separate test for each dimension is required. 

It is not always true that v; < m!/¢, although this upper bound is valid when 
the points form a rectangular grid. For example, it turns out that vs = /274 > 
V256 in Fig. 8, because a nearly hexagonal structure brings the m points closer 
together than would be possible in a strictly rectangular arrangement. 

In order to develop an algorithm that computes r efficiently, we must look 
more deeply at the associated mathematical theory. Therefore a reader who is 
not mathematically inclined is advised to skip to part D of this section, where 
the spectral test is presented as a “plug-in” method accompanied by several 
examples. But the mathematics behind the spectral test requires only some 
elementary manipulations of vectors. 

Some authors have suggested using the minimum number JN; of parallel 
covering lines or hyperplanes as the criterion, instead of the maximum distance 
1/™ between them. However, this number N; does not appear to be as important 
as the concept of accuracy defined above, because it is biased by how nearly 
the slope of the lines or hyperplanes matches the coordinate axes of the cube. 
For example, the 20 nearly vertical lines that cover all the points of Fig. 8(a) 
are actually 1/./328 units apart, according to Eq. (14) below with (u1, u2) = 
(18, —2); this might falsely imply an accuracy of one part in 328, or perhaps 
even an accuracy of one part in 20. The true accuracy of only one part in 274 is 
realized only for the larger family of 21 lines with a slope of 7/15; another family 
of 24 lines, with a slope of —11/13, also has a greater inter-line distance than 
the 20-line family, since 1/290 > 1/./328. The precise way in which families 
of lines act at the boundaries of the unit hypercube does not seem to be an 
especially “clean” or significant criterion. However, for those people who prefer 
to count hyperplanes, it is possible to compute N; using a method quite similar 
to the way in which we shall calculate 1, (see exercise 16). 


*B. Theory behind the test. In order to analyze the basic set (2), we start 
with the observation that 


j seai l 
~ sie) = (4 r+(l+a+ -ta £) nori (5) 
m m 


We can get rid of the “mod 1” operation by extending the set periodically, making 
infinitely many copies of the original t-dimensional hypercube, proceeding in all 
directions. This gives us the set 


t1) 
L= f (Zt E 4 by. +m) 


m m 


integer x, kı, k2,.. ok} 


t-iy 


= {v+ (Z +i S+h,...,° +h) 
m m 


integer x, k1, k2,..., ki), 


3.3.4 THE SPECTRAL TEST 97 


where 1 
Yo = — (0, c, (1 +a)c, ..., +a +: +a ?’)e) (6) 


is a constant vector. The variable kı is redundant in this representation of L, 
because we can change (x, k1, k2, ... , kt) to (at+kim, 0, ko—akı, ..., ky—a’~ ky), 
reducing kı to zero without loss of generality. Therefore we obtain the compara- 
tively simple formula 


L = {Vo + Y1 Vı + yY2V2 +: + yV; | integer y1, yo,.--, ye}, (7) 
where 
1 
A ee a: (8) 
m 
Vo = (0,1,0,...,0), V3=(0,0,1,...,0), ..., V;=(0,0,0,...,1). (9) 


The points (11, 2%2,...,24) of L that satisfy 0 < a; < 1 for all j are precisely the 
m points of our original set (2). 

Notice that the increment c appears only in Vo, and the effect of Vo is 
merely to shift all elements of L without changing their relative distances; hence 
c does not affect the spectral test in any way, and we might as well assume that 
Vo = (0,0,...,0) when we are calculating vs. When Vo is the zero vector we 
have a lattice of points 


Lo = {yiVi + yoVo +- + ytVı | integer yi, yo,---, Yt}, (10) 


and our goal is to study the distances between adjacent (t — 1)-dimensional 
hyperplanes, in families of parallel hyperplanes that cover all the points of Lo. 

A family of parallel (¢ — 1)-dimensional hyperplanes can be defined by a 
nonzero vector U = (u1,..., uz) that is perpendicular to all of them; and the set 
of points on a particular hyperplane is then 


{(a1,..-, 2+) | Git ++++ + rru = Gh, (11) 
where q is a different constant for each hyperplane in the family. In other words, 
each hyperplane is the set of all vectors X for which the dot product X-U has a 
given value q. In our case the hyperplanes are all separated by a fixed distance, 
and one of them contains (0,0,...,0); hence we can adjust the magnitude of U 
so that the set of all integer values q gives all the hyperplanes in the family. 
Then the distance between neighboring hyperplanes is the minimum distance 
from (0,0,...,0) to the hyperplane for q = 1, namely 


min {V +a3 


real ijs Lt 


muy +o + ae, = 1}. (12) 


Cauchy’s inequality (see exercise 1.2.3-30) tells us that 
(wu + +++ + geu)? < (wt +--+ + wp) (uy +--+ + uz), (13) 


hence the minimum in (12) occurs when each x; = u;/(uj+---+u?); the distance 
between neighboring hyperplanes is 


1/Vuj +---+ uz =1/length(U). (14) 


98 RANDOM NUMBERS 3.3.4 


In other words, the quantity v, that we seek is precisely the length of the shortest 
vector U that defines a family of hyperplanes {X-U = q | integer q} containing 
all the elements of Lo. 


Such a vector U = (u1,..., uz) must be nonzero, and it must satisfy V- U = 
integer for all V in Lo. In particular, since the points (1,0,...,0), (0,1,...,0), 
..., (0,0,...,1) are all in Lo, all of the uj must be integers. Furthermore since 
Vı is in Lo, we must have + (uy + aug +--+ + atlu) = integer, i.e., 


uy + aug +- + a'u =0 (modulo m). (15) 


Conversely, any nonzero integer vector U = (u1,..., uz) satisfying (15) defines a 
family of hyperplanes with the required properties, since all of Lo will be covered: 
The dot product (y,Vi+---+y,V;) -U will be an integer for all integers y1, ..., Yt- 
We have proved that 


2 2 2 t-1 T 
ví = min uit- +u | ui +au2+ -+a u = 0 (modulo m 
: (Uris ae em j t | i ? ' ( )} 
= min ((ma—ar2—a7x3— +++ —a*~* a4)? +234+25+ +++ +27). 


(16) 


C. Deriving a computational method. We have now reduced the spectral 
test to the problem of finding the minimum value (16); but how on earth can we 
determine that minimum value in a reasonable amount of time? A brute-force 
search is out of the question, since m is very large in cases of practical interest. 

It will be interesting and probably more useful if we develop a computational 
method for solving an even more general problem: Find the minimum value of 
the quantity 


f(x1,---, 24) = (uti +++ + eee)? ++ + (user +++ + weer)? (17) 


over all nonzero integer vectors (#1,...,%4), given any nonsingular matrix of 
coefficients U = (uij). The expression (17) is called a “positive definite quadratic 
form” in t variables. Since U is nonsingular, (17) cannot be zero unless the x; 
are all zero. 


Let us write U1, ..., Us for the rows of U. Then (17) may be written 
f (21, sees Xt) = (aU, tere xU) ‘ (aU, tere zUt), (18) 


the square of the length of the vector x1U1 +--+ x+:U:. The nonsingular matrix 
U has an inverse, which means that we can find uniquely determined vectors 
Vi, .-., Ve such that 


U;-Vj=6j, 1<i,j<t. (19) 
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For example, in the special form (16) that arises in the spectral test, we have 


U, =( m,0,0,...,0), Vı = (Lew ae), 
U2 = ( —a,1,0,...,0), V= (0,1, Ont 0), 
U3 =( —a?,0,1,...,0), V3= (0,0, 1,..., 0), (20) 
U; = (—a*™t,0,0,...,1), V, = (0,0, jeras 1). 


These V; are precisely the vectors (8), (9) that we used to define our original 
lattice Lo. As the reader may well suspect, this is not a coincidence — indeed, if 
we had begun with an arbitrary lattice Lo, defined by any set of linearly inde- 
pendent vectors V1, ..., V}, the argument we have used above can be generalized 
to show that the maximum separation between hyperplanes in a covering family 
is equivalent to minimizing (17), where the coefficients u;; are defined by (19). 
(See exercise 2.) 

Our first step in minimizing (18) is to reduce it to a finite problem, namely 


to show that we won’t need to test infinitely many vectors (a1,...,2;) when 
finding the minimum. This is where the vectors Vj,...,V; come in handy; we 
have 


Tk = (aU, Peet zU) : Ve, 
and Cauchy’s inequality tells us that 
((£1U1 + +++ + aU): Ve)? SF (Gipsy pe (Vk: Vk). 


Hence we have derived a useful upper bound on each coordinate xx: 


Lemma A. Let (%1,...,24) be a nonzero vector that minimizes (18) and let 
(y1,---; Yt) be any nonzero integer vector. Then 
xe < Fly,- --, Ye) (Vk Ve), forl<k<t. (21) 


In particular, letting y; = ĝi; for all i, 
xp < (Uj + Uj) (Vie Ve), forl<j,k<t. J (22) 


Lemma A reduces the problem to a finite search, but the right-hand side of 
(21) is usually much too large to make an exhaustive search feasible; we need at 
least one more idea. On such occasions, an old maxim provides sound advice: “If 
you can’t solve a problem as it is stated, change it into a simpler problem that 
has the same answer.” For example, Euclid’s algorithm has this form; if we don’t 
know the gcd of the input numbers, we change them into smaller numbers having 
the same gcd. (In fact, a slightly more general approach probably underlies the 
discovery of nearly all algorithms: “If you can’t solve a problem directly, change 
it into one or more simpler problems, from whose solution you can solve the 
original one.”) 

In our case, a simpler problem is one that requires less searching because the 
right-hand side of (22) is smaller. The key idea we shall use is that it is possible 
to change one quadratic form into another one that is equivalent for all practical 
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purposes. Let j be any fixed subscript, 1 < j < t; let (q@1,..-,qj-1,9j41)--+>%) 
be any sequence of t — 1 integers; and consider the following transformation of 
the vectors: 


Vi = V,- 4;V;, T, = T; — G25, Uj = U,, for i Æ j; (23) 
1 AE i. 
V =V; Tj = 25, U; = U; + X izy GUi- 
It is easy to see that the new vectors Uj, ..., U; define a quadratic form f’ 
for which f'(x4,..., £4) = f(x,,...,x,); furthermore the basic orthogonality 
condition (19) remains valid, because it is easy to check that U; - V; = 6i;. As 
(%1,...,@4) runs through all nonzero integer vectors, so does (x4, ..., 24); hence 


the new form f’ has the same minimum as f. 

Our goal is to use transformation (23), replacing U; by U; and V; by V/ for 
all i, in order to make the right-hand side of (22) small; and the right-hand side 
of (22) will be small when both U;-U; and V;,-V, are small. Therefore it is 
natural to ask the following two questions about the transformation (23): 

a) What choice of q; makes V? - V} as small as possible? 
b) What choice of qı, ..., qj-1, Gj+1, - --, @ makes U; - U; as small as possible? 

It is easiest to solve these questions first for real values of the qi. Question (a) 
is quite simple, since 
(Vi — qi Vj) (Vi — qi Vj) = Vi- Vi — 20 Vi Vi +4? Vj- V; 

2 
= (Vj V;) (ai= Wie Vil Vj Vi) Vi Vi = (Va V;)?/ Vi V, 
and the minimum occurs when 
qi = Vi: Vj / Vj: Vj. (24) 


Geometrically, we are asking what multiple of V; should be subtracted from V; 
so that the resulting vector V/ has minimum length, and the answer is to choose 
qi so that V; is perpendicular to V; (that is, to make V/ - V = 0); the following 
diagram makes this plain. 


—qiVj 


= 1 
v=; 


Turning to question (b), we want to choose the q; so that U;+>7j., qiU; has 
minimum length; geometrically, we want to start with U; and add some vector 
in the (t — 1)-dimensional hyperplane whose points are the sums of multiples 
of {U; | i # j}. Again the best solution is to choose things so that Uj is 
perpendicular to the hyperplane, making U. 5 -U, =0 for all k Fj: 


U;-Ur+ So q(Ui Ur) =0, 1Sk<t, kj. (26) 
ižj 
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(See exercise 12 for a rigorous proof that a solution to question (b) must satisfy 
these t — 1 equations.) 

Now that we have answered questions (a) and (b), we are in a bit of a 
quandary; should we choose the q; according to (24), so that the V/-V;, are 
minimized, or according to (26), so that U; : U; is minimized? Either of these 
alternatives makes an improvement in the right-hand side of (22), so it is not 
immediately clear which choice should get priority. Fortunately, there is a very 
simple answer to this dilemma: Conditions (24) and (26) are exactly the same! 
(See exercise 7.) Therefore questions (a) and (b) have the same answer; we have 
a happy state of affairs in which we can reduce the length of both the U’s and 
the V’s simultaneously. Indeed, we have just rediscovered the Gram-Schmidt 
orthogonalization process [see Crelle 94 (1883), 41-73]. 

Our joy must be tempered with the realization that we have dealt with 
questions (a) and (b) only for real values of the q;. Our application restricts us 
to integer values, so we cannot make V/ exactly perpendicular to V;. The best 
we can do for question (a) is to let q; be the nearest integer to Vi: V; / Vj: Vj 
(see (25)). It turns out that this is not always the best solution to question (b); 
in fact U; may at times be longer than U;. However, the bound (21) is never 
increased, since we can remember the smallest value of f(yi,...,yz) found so 
far. Thus a choice of q; based solely on question (a) is quite satisfactory. 

If we apply transformation (23) repeatedly in such a way that none of the 
vectors V; gets longer and at least one gets shorter, we can never get into a 
loop; that is, we will never be considering the same quadratic form again after 
a sequence of nontrivial transformations of this kind. But eventually we will 
get stuck, in the sense that none of the transformations (23) for 1 < j < t 
will be able to shorten any of the vectors Vi, ..., Vp. At that point we can 
revert to an exhaustive search, using the bounds of Lemma A, which will now 
be quite small in most cases. Occasionally these bounds (21) will be poor, and 
another type of transformation will usually get the algorithm unstuck again and 
reduce the bounds (see exercise 18). However, transformation (23) by itself has 
proved to be quite adequate for the spectral test; in fact, it has proved to be 
amazingly powerful when the computations are arranged as in the algorithm 
discussed below. 


*D. How to perform the spectral test. Here now is an efficient computational 
procedure that follows from our considerations. R. W. Gosper and U. Dieter 
have observed that it is possible to use the results of lower dimensions to make 
the spectral test significantly faster in higher dimensions. This refinement has 
been incorporated into the following algorithm, together with Gauss’s significant 
simplification of the two-dimensional case (exercise 5). 


Algorithm S (The spectral test). This algorithm determines the value of 
y= min{ Vr} +e +2? 


for 2 < t < T, given a, m, and T, where 0 < a < m and a is relatively prime to 
m. (The minimum is taken over all nonzero integer vectors (x1,...,24), and the 


ay tara +- +a lr, =0 (modulo m)} (27) 
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number v, measures the t-dimensional accuracy of random number generators, 
as discussed in the text above.) All arithmetic within this algorithm is done on 
integers whose magnitudes rarely if ever exceed m?, except in step S7; in fact, 
nearly all of the integer variables will be less than m in absolute value during 
the computation. 

When v is being calculated for t > 3, the algorithm works with two t x t 


matrices U and V, whose row vectors are denoted by U; = (uj1,..., wiz) and 
V; = (vi1,.--, vit) for 1 <i < t. These vectors satisfy the conditions 

Us, + aug +: + at lun = 0 (modulo m), 1<i<t; (28) 

Ui- Vj = mdi;, 1<4,j <t. (29) 


(Thus the V; of our previous discussion have been multiplied by m, to ensure 
that their components are integers.) There are three other auxiliary vectors, 
X = (,...,0%1), Y = (y,---, yt), and Z = (z,...,2%). During the entire 
algorithm, r will denote a®~! mod m and s will denote the smallest upper bound 
for v? that has been discovered so far. 
S1. [Initialize] Set t — 2,hca,h’'i m, p+l1,p'¢0,rea,s¢1+a?. 
(The first steps of this algorithm handle the case t = 2 by a special method, 
very much like Euclid’s algorithm; we will have 


h— ap = k' — ap’ = 0 (modulo m) and hp —h'p=+m_ (30) 


during this phase of the calculation.) 
S2. [Euclidean step.] Set q + |[h’/h|, u + R’ — qh, v & p' — qp. If u? +v? <s, 
set s +} u? +v?, h! + h, h + u, p' + p, p+», and repeat step S2. 


S3. [Compute v2.] Set u + u—h, v + v—p; and if u? +v? < s, set s +} u? +v?, 
h' 4+ u, p' + v. Then output ys = v2. (The validity of this calculation for 
the two-dimensional case is proved in exercise 5. Now we will set up the U 
and V matrices satisfying (28) and (29), in preparation for calculations in 
higher dimensions.) Set 


—h p r h' 
voà) vea ¥) 


where the — sign is chosen for V if and only if p' > 0. 


S4. [Advance t.] If t = T, the algorithm terminates. (Otherwise we want to 
increase t by 1. At this point U and V are t x t matrices satisfying (28) 
and (29), and we must enlarge them by adding an appropriate new row 
and column.) Set t + t+ 1 and r + (ar) mod m. Set U, to the new row 
(—r,0,0,...,0,1) of t elements, and set u;, < 0 for 1 < i < t. Set V; to the 
new row (0,0,0,...,0,m). Finally, for 1 <i < t, set q < round (vir/m), 
Vit — vir — qm, and U; + U, +qU;. (Here “round(«)” denotes the nearest 
integer to x, e.g., |x + 1/2]. We are essentially setting vi: < vir and 
immediately applying transformation (23) with j = t, since the numbers 
|viir| are so large they ought to be reduced at once.) Finally set s + 
min(s, U;-U;), k + t, and j + 1. (In the following steps, j denotes the 
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current row index for transformation (23), and k denotes the last such index 
where the transformation shortened at least one of the V;.) 


S5. [Transform.] For 1 < i < t, do the following operations: If i Æ j and 
2|Vi-V;| > V- Vj, set q — round(V;-V; / Vj- Vi), Vi — Vi — Vj, Uj = 
U; +qU;, s + min(s, U;-U;), and k + j. (We omit the transformation 
when 2 |V; -V;| exactly equals V; - Vj; exercise 19 shows that this precaution 
keeps the algorithm from looping endlessly.) 

S6. [Advance j.] If j = t, set j < 1; otherwise set j + j +1. Now if j Zk, 
return to step S5. (If j = k, we have gone through t¢ — 1 consecutive cycles 
of no transformation, so the transformation process is stuck.) 

S7. [Prepare for search.] (Now the absolute minimum will be determined, 
using an exhaustive search over all (z1,...,£+) satisfying condition (21) 
of Lemma A.) Set X + Y + (0,...,0), set k + t, and set 


z e VIO Vs] |, fort < jc. (31) 


(We will examine all X = (x1,..., £+) with |x| < zj for 1 < j < t. Usually 
|z| < 1, but L. C. Killingbeck noticed in 1999 that larger values occur 
for about 0.00001 of all multipliers when m = 264. During the exhaustive 
search, the vector Y will always be equal to 7,U,; +---+ £U, so that 
f(a1,.--,%1) = Y-Y. Since f(—a1,...,-a1) = f(£1,..., £4), we shall ex- 
amine only vectors whose first nonzero component is positive. The method 
is essentially that of counting in steps of one, regarding (a1,..., £+) as the 
digits in a balanced number system with mixed radices (2z;+1, ..., 22:+1); 
see Section 4.1.) 

S8. [Advance xz.] If £k = Zk, go to S10. Otherwise increase x, by 1 and set 
YecyY+U,. 

S9. [Advance k.] Set k + k+1. Then if k < t, set £k — — zk, Y 4+ Y —2z,Ug, 
and repeat step S9. But if k > t, set s + min(s, Y- Y). 

S10. [Decrease k.] Set k + k-— 1. If k > 1, return to S8. Otherwise output 

vi = ys (the exhaustive search is completed) and return to S4. J 


In practice Algorithm S is applied for T = 5 or 6, say; it usually works reasonably 
well when T = 7 or 8, but it can be terribly slow when T > 9 since the exhaustive 
search tends to make the running time grow as 37. (If the minimum value v 
occurs at many different points, the exhaustive search will hit them all; hence 
we typically find that all z, = 1 for large t. As remarked above, the values of 1 
are generally irrelevant for practical purposes when t is large.) 

An example will help to make Algorithm S clear. Consider the linear 
congruential sequence defined by 


m = 107°, a = 3141592621, e=1, Xo =0. (32) 


Six cycles of the Euclidean algorithm in steps S2 and S3 suffice to prove that the 
minimum nonzero value of x? + x3 with 


xı + 314159262172 =0 (modulo 101°) 
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occurs for xı = 67654, £a = 226; hence the two-dimensional accuracy of this 


generator is 
V2 = y 67654? + 2262 ~ 67654.37748. 


Passing to three dimensions, we seek the minimum nonzero value of x} + 73 +23 
such that 


xı + 314159262122 + 3141592621723 =0 (modulo 10!°). (33) 
Step S4 sets up the matrices 


—67654 —226 0 —191 —44190611 2564918569 
U = | —44190611 1910], V = | —226 67654 1307181134 |. 
5793866 33 1 0 0 10000000000 
The first iteration of step S5, with q = 1 for i = 2 and q = 4 for i = 3, changes 
them to 
—21082801 97 4 —191 —44190611 2564918569 
U = | —44190611 1910], V= —35 44258265 —1257737435 |. 
5793866 33 1 764 176762444 —259674276 


(The first row U, has actually gotten longer in this transformation, although 
eventually the rows of U should get shorter.) 

The next fourteen iterations of step S5 have (j, q1, q2,q3) = (2, —2, x, 0), 
(3,0,3, *), (1, *, -10,—-1), (2,-1, *, -6), (3,-1,0,*), (1, *, 0,2), (2,0, *,—-1), 
(3, 3,4, *), (1, *,0,0), (2,5, *,0), (3,1,0,*), (1,*, —3,—1), (2,0, *, 0), (3,0, 0, *). 
Now the transformation process is stuck, but the rows of the matrices have 
become significantly shorter: 


—1479 616 —2777 —888874 601246 —2994234 
U = | —3022 104 918], V = | —2809871 438109 1593689}. (34) 
—227 —983 —130 —854296 —9749816 —1707736 


The search limits (z1, z2,23) in step S7 turn out to be (0,0,1), so U3 is the 
shortest solution to (33); we have 


v3 = 2272 + 9832 + 1302 ~ 1017.21089. 


Only a few iterations were needed to find this value, although condition (33) 
looks quite formidable at first glance. Our computation has proved that all 
points (Un, Un+1, Un+2) produced by the random number generator (32) lie on a 
family of parallel planes about 0.001 units apart, but not on any family of planes 
that differ by more than 0.001 units. 

The exhaustive search in steps S8-S10 reduces the value of s only rarely. 
One such case, found in 1982 by R. Carling and K. Levine, occurs when a = 
464680339, m = 279, and t = 5; another case arose when the author calculated 
vg for line 21 of Table 1, later in this section. 


E. Ratings for various generators. So far we haven’t really given a criterion 
that tells us whether or not a particular random number generator passes or 
flunks the spectral test. In fact, spectral success depends on the application, 
since some applications demand higher resolution than others. It appears that 
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1 > 239/* for 2 < t < 6 will be quite adequate for most purposes (although 
the author must admit choosing this criterion partly because 30 is conveniently 
divisible by 2, 3, 5, and 6). 

For some purposes we would like a criterion that is relatively independent 
of m, so we can say that a particular multiplier is good or bad with respect to 
the set of all other multipliers for the given m, without examining any others. 
A reasonable figure of merit for rating the goodness of a particular multiplier 
seems to be the volume of the ellipsoid in t-space defined by the relation 


a 3 ee 
(zım — zza — -+ — aya") + £2 +e + ay <<, 


since this volume tends to indicate how likely it is that nonzero integer points 
(£1, .. . , £6) — corresponding to solutions of (15) —are in the ellipsoid. We there- 
fore propose to calculate this volume, namely 


t/2 vi 
Ht = Am (35) 


as an indication of the effectiveness of the multiplier a for the given m. In this 


formula, 
(=G) G) f toda 0) 


Thus, in six or fewer dimensions the merit is computed as follows: 


po =m3/m, — wg = 4mv3/m, pa = BV /m, 

us = $n? /m, pe = 08 /m. 
We might say that the multiplier a passes the spectral test if u+ is 0.1 or more 
for 2 < t <6, and it “passes with flying colors” if uw, > 1 for all these t. A low 
value of us means that we have probably picked a very unfortunate multiplier, 
since very few lattices will have integer points so close to the origin. Conversely, 
a high value of u, means that we have found an unusually good multiplier for 
the given m; but it does not mean that the random numbers are necessarily very 
good, since m might be too small. Only the values v truly indicate the degree 
of randomness. 

Table 1 shows what sorts of values occur in typical sequences. Each line of 
the table considers a particular generator, and lists v7, u, and the “number of 
bits of accuracy” lgv,. Lines 1 through 4 show the generators that were the sub- 
ject of Figs. 2 and 5 in Section 3.3.1. The generators in lines 1 and 2 suffer from 
too small a multiplier; a diagram like Fig. 8 will have a nearly vertical “stripes” 
when a is small. The terrible generator in line 3 has a good u2 but very poor u3 
and ua; like nearly all generators of potency 2, it has v3 = v6 and 14 = 2 (see 
exercise 3). Line 4 shows a “random” multiplier; this generator has satisfactorily 
passed numerous empirical tests for randomness, but it does not have especially 
high values of u2,..., ue. In fact, the value of us flunks our criterion. 

Line 5 shows the generator of Fig. 8. It passes the spectral test with very 
high-flying colors, when u2 through pe are considered, but of course m is so small 
that the numbers can hardly be called random; the 1, values are terribly low. 
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Table 1 
SAMPLE RESULTS OF THE SPECTRAL TEST 
Line a m v2 v3 v2 ve v2 
1 23 108+1 530 530 530 530 447 
2 2741 235 16642 16642 16642 15602 252 
3 284) 235 34359738368 6 4 4 4 
4 3141592653 235 2997222016 1026050 27822 1118 1118 
5 137 256 274 30 14 6 4 
6 3141592621 101° 4577114792 1034718 62454 1776 542 
7 3141592221 1010 4293881050 276266 97450 3366 2382 
8 4219755981 10!° | 10721093248 2595578 49362 5868 820 
9 4160984121 1010 9183801602 4615650 16686 6840 1344 
10 224421345 235 8364058 8364058 21476 16712 1496 
11 513 235 33161885770 2925242 113374 13070 2256 
12 21643 229 536936458 118 116 116 116 
13 1812433253 232 4326934538 1462856 15082 4866 906 
14 1566083941 232 4659748970 2079590 44902 4652 662 
15 69069 232 4243209856 2072544 52804 6990 242 
16 2650845021 232 4938969760 2646962 68342 8778 1506 
17 314159269 231_1 | 1432232969 899290 36985 3427 1144 
18 62089911 231_1 | 1977289717 1662317 48191 6101 1462 
19 16807 231—1 282475250 408197 21682 4439 895 
20 48271 231—1 | 1990735345 1433881 47418 4404 1402 
21 40692 231—249| 1655838865 1403422 42475 6507 1438 
22 | 44485709377909 246 5.6x 1018 1180915002 1882426 279928 26230 
23 31167285 248 3.2x10!4 4111841446 17341510 306326 59278 
24 see (38) 2.4x10!8 4.7x10!! 1.9x10° 3194548 1611610 
25 see (39) (231—1)? 1.4x 10!2 643578623 12930027 837632 
26 see the text 264 8.8x10!8 6.4x10!? 4.1x10° 45662836 1846368 
27 see the text x 278 262 +1 4281084902 2.2x109 1.8x109 1862407 
28 2—24-389 py 2576 1.8x10!78 3.5x10!!5 4.4x1086 2x10°9 5x1057 
29 | (232-5)-400 | œ 21376 1.6x 10414 8.6x10?75 1x10207 2x10165 8x10137 


Line 6 is the generator discussed in (32) above. Line 7 is a similar example, 
having an abnormally low value of u3. Line 8 shows a nonrandom multiplier 
for the same modulus m; all of its partial quotients are 1, 2, or 3. Such 
multipliers have been suggested by I. Borosh and H. Niederreiter because the 
Dedekind sums are likely to be especially small and because they produce best 
results in the two-dimensional serial test (see Section 3.3.3 and exercise 30). The 
particular example in line 8 has only one ‘3’ as a partial quotient; there is no 
multiplier congruent to 1 modulo 20 whose partial quotients with respect to 101° 
are only 1s and 2s. The generator in line 9 shows another multiplier chosen with 
malice aforethought, following a suggestion by A. G. Waterman that guarantees 
a reasonably high value of 2 (see exercise 11). Line 10 is interesting because it 
has high u3 in spite of very low p2 (see exercise 8). 

Line 11 of Table 1 is a reminder of the good old days — it once was used ex- 
tensively, following a suggestion of O. Taussky in the early 1950s. But computers 
for which 2° was an appropriate modulus began to fade in importance during 
the late 60s, and they disappeared almost completely in the 80s, as machines 
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lgv2 lgv3 Igva lgvs Igve | p2 u3 pa u5 ue =| Line 


45 45 45 45 44] 2 5e4 0.01 0.34 4.62 1 
7.0 7.0 7.0 7.0 4.0] 2 34 0.04 466 2% 2 
17.5 13 10 10 1.0]3.14 29 29 59 e8 3 
15.7 10.0 74 51 5.1] 0.27 0.13 0.11 0.01 0.21 4 
40 25 19 1.3 1.0]3.36 2.69 3.78 1.81 1.29 5 
6 
7 
8 


16.0 10.0 80 54 45]1.44 0.44 1.92 0.07 0.08 
16.0 90 83 59 5.6]1.35 0.06 4.69 0.35 6.98 
16.7 10.7 7.8 63 4.8] 3.37 1.75 1.20 1.39 0.28 
16.5 111 7.0 64 5.2/2.89 415 0.14 2.04 1.25 9 
11.5 115 72 7.0 5.3] 84 2.95 0.07 5.53 0.50 | 10 
17.5 10.7 84 68 5.6]3.03 061 1.85 2.99 1.73 | 11 
14.5 34 34 34 34/314 6 et e 0.02 | 12 
16.0 10.2 69 61 4.93.16 1.73 0.26 2.02 0.89 | 13 
16.1 105 7.7 61 £4.7|/3.41 2.92 232 1.81 0.35 | 14 
16.0 105 7.8 64 4.0]3.10 2.91 3.20 5.01 0.02 | 15 
16.1 10.7 80 66 5.3]3.61 4.20 5.37 885 4.11 | 16 
15.2 99 76 59 512.10 166 3.14 1.69 3.60 | 17 
154 103 7.8 63 53/289 418 5.34 7.13 7.52 | 18 
140 93 7.2 61 49/041 0.51 1.08 3.22 1.73 | 19 
15.4 102 7.8 61 5.2]2.91 3.35 5.17 3.15 6.63 | 20 
15.3 10.2 7.7 63 5.2]|2.42 3.24 4.15 837 7.16 | 21 
22.8 15.1 104 9.0 7.3]2.48 2.42 0.25 3.10 1.33 | 22 
24.1 16.0 12.0 9.1 7.9] 3.60 3.92 5.27 0.97 3.82 | 23 
30.5 19.4 15.4 10.8 10.3/1.65 0.29 3.88 0.02 4.69 | 24 
31.0 20.2 146 11.8 9.8]3.14 1.49 0.44 0.69 0.66 | 25 
31.5 21.3 16.0 12.7 10.4]1.50 3.68 4.52 4.02 1.76 | 26 
31.0 16.0 15.5 15.4 10.4] 5e® 4e9 86 2.56 gf 27 
288. 192. 144. 115. 95.9] 2.27 3.46 3.92 2.49 2.98 | 28 
688. 458. 344. 275. 229. | 3.10 2.04 2.85 1.15 1.33 | 29 


upper bounds from (40): 3.63 5.92 9.87 14.89 23.87 


with 32-bit arithmetic began to proliferate. This switch to a comparatively small 
word size called for comparatively greater care. Line 12 was, alas, the generator 
actually used on such machines in most of the world’s scientific computing centers 
for more than a decade; its very name RANDU is enough to bring dismay into the 
eyes and stomachs of many computer scientists! The actual generator is defined 
by Xp odd,  Xn+i = (65539X,) mod 231, (37) 
and exercise 20 indicates that 27° is the appropriate modulus for the spectral 
test. Since 9X, — 6Xn+1 + Xni2 = 0 (modulo 23!), the generator fails most 
three-dimensional criteria for randomness, and it should never have been used. 
Almost any multiplier = 5 (modulo 8) would be better. (A curious fact about 
RANDU, noticed by R. W. Gosper, is that v4 = v5 = vg = Vy = Vg = vg = V116, 
hence jig is a spectacular 11.98.) Lines 13 and 14 are the Borosh—Niederreiter and 
Waterman multipliers for modulus 232. Line 16 was found by L. C. Killingbeck, 
who carried out an exhaustive search of all multipliers a = 1 mod 4 when 
m = 2°. Line 23, similarly, was found by M. Lavaux and F. Janssens in a 
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(nonexhaustive) computer search for spectrally good multipliers having a very 
high u2. Line 22 is for the multiplier used with c = 0 and m = 28 in the Cray X- 
MP library; line 26 (whose excellent multiplier 6364136223846793005 is too big 
to fit in the column) is due to C. E. Haynes. Line 15 was nominated by George 
Marsaglia as “a candidate for the best of all multipliers,” after a computer search 
for nearly cubical lattices in dimensions 2 through 5, partly because it is easy 
to remember [Applications of Number Theory to Numerical Analysis, edited by 
S. K. Zaremba (New York: Academic Press, 1972), 275]. 

Line 17 uses a random primitive root, modulo the prime 2°!—1, as multiplier. 
Line 18 shows the spectrally best primitive root for 2°!—1, found in an exhaustive 
search by G. S. Fishman and L. R. Moore III [SIAM J. Sci. Stat. Comput. 7 
(1986), 24-45]. The adequate but less outstanding multiplier 16807 = 7° in 
line 19 is actually used most often for that modulus, after being proposed by 
Lewis, Goodman, and Miller in IBM Systems J. 8 (1969), 136-146; it has been 
one of the main generators in the popular IMSL subroutine library since 1971. 
The main reason for continued use of a = 16807 is that a? is less than the 
modulus m, hence az mod m can be implemented with reasonable efficiency in 
high-level languages using the technique of exercise 3.2.1.1-9. However, such 
small multipliers have known defects. S. K. Park and K. W. Miller noticed that 
the same implementation technique applies also to certain multipliers greater 
than ym, so they asked G. S. Fishman to find the best “efficiently portable” 
multiplier in this wider class; the result appears in line 20 [CACM 31 (1988), 
1192-1201]. Line 21 shows another good multiplier, due to P. L’Ecuyer [CACM 
31 (1988), 742-749, 774]; this one uses a slightly smaller prime modulus. 

When the generators of lines 20 and 21 are combined by subtraction as 
suggested in Eq. 3.2.2—(15), so that the generated numbers (Z,,) satisfy 


Xn+1 = 48271X„ mod (23! — 1), Yn41 = 40692Y,, mod (2! — 249), 


Zn = (Xn — Yn) mod (23! — 1), (38) 


exercise 32 shows that it is reasonable to rate (Zn) with the spectral test for 
m = (281 —1)(23t 249) and a = 1431853894371298687. (This value of a satisfies 
a mod (231 — 1) = 48271 and a mod (23! — 249) = 40692.) The results appear on 
line 24. We needn’t worry too much about the low value of us, since v5 > 1000. 
Generator (38) has a period of length (23! — 2)(23! — 250) /62 ~ 7 x 101°. 

Line 25 of the table represents the sequence 


Xn = (271828183.X,,_1 — 314159269X,_2) mod (23! — 1), (39) 


which can be shown to have period length (23t — 1)? — 1; it has been analyzed 
with the generalized spectral test of exercise 24. 

The last three lines of Table 1 are based on add-with-carry and subtract- 
with-borrow methods, which simulate linear congruential sequences that have 
extremely large moduli (see exercise 3.2.1.1-14). Line 27 is for the generator 


Xn = (Xn—1 + 65430X,_2 + Cn) mod 27", 
Cn41 = |(Xn—-1 + 65430Xn_2 + Cn)/2*'], 
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which corresponds to ¥,41 = (65430 - 231 + 1)%,, mod (65430 - 2° + 23! — 1); the 
numbers in the table refer to the “super-values” 


Xn = (65430 - 231 + 1)X,,_1 + 65430X,_2 + Ch 


rather than to the values X, actually computed and used as random numbers. 
Line 28 represents a more typical subtract-with-borrow generator 


Xn = (Xpn_-10 — Xn—24 — Cn) mod a, Cn+1 = [Xn—-10 < Xn-24 + Ch], 


but modified by generating 389 elements of the sequence and then using only the 
first (or last) 24. This generator, called RANLUX, was recommended by Martin 
Lüscher after it passed many stringent tests that previous generators failed 
[Computer Physics Communications 79 (1994), 100-110]. A similar sequence, 


Xn = (Xn-22 = Xn—43 =m Ch) mod (a? m 5), Cn+1 = [Xn-22 < Xn—43 + Ch], 


with 43 elements used after 400 are generated, appears in line 29; this sequence is 
discussed in the answer to exercise 3.2.1.2—22. In both cases the table entries refer 
to the spectral test on multiprecision numbers ¥,, instead of to the individual 
“digits” Xn, but the high p values indicate that the process of generating 389 or 
400 numbers before selecting 24 or 43 is an excellent way to remove biases due 
to the extreme simplicity of the generation scheme. 

Theoretical upper bounds on w+, which can never be transcended for any m, 
are shown just below Table 1; it is known that every lattice with m points per 
unit volume has 


nym, (40) 

where 7, takes the respective values 
(4/3)1/2, 21/3, 21/2, 23/5, (g4/3)/8, 43⁄7, 2 (41) 
for t = 2, ..., 8. [See exercise 9 and J. W. S. Cassels, Introduction to the 


Geometry of Numbers (Berlin: Springer, 1959), 332; J. H. Conway and N. J. A. 
Sloane, Sphere Packings, Lattices and Groups (New York: Springer, 1988), 
20.] These bounds hold for lattices generated by vectors with arbitrary real 
coordinates. For example, the optimum lattice for t = 2 is hexagonal, and it 
is generated by vectors of length 2//3m that form two sides of an equilateral 
triangle. In three dimensions the optimum lattice is generated by vectors Vj, 
V2, V3 that can be rotated into the form (v,v,—v), (v,—v,v), (—v,v, v), where 


v =1/V4m. 


*F. Relation to the serial test. In a series of important papers published 
during the 1970s, Harald Niederreiter showed how to analyze the distribution of 
the t-dimensional vectors (1) by means of exponential sums. One of the main 
consequences of his theory is that the serial test in several dimensions will be 
passed by any generator that passes the spectral test, even when we consider 
only a sufficiently large part of the period instead of the whole period. We 
shall now turn briefly to a study of his interesting methods, in the case of linear 
congruential sequences (Xo, a,c, m) of period length m. 
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The first idea we need is the notion of discrepancy in t dimensions, a 
quantity that we shall define as the difference between the expected number 
and the actual number of ¢-dimensional vectors (@n,@n+41,---,2n+t—-1) falling 
into a hyper-rectangular region, maximized over all such regions. To be precise, 
let (£n) be a sequence of integers in the range 0 < £n < m. We define 


number of (£n,...,£n4t-1) in RforO<n<WN _ volume of R 
N mt 


(t) 
Dy’ = max 


(42) 
where R ranges over all sets of points of the form 


R = {(y1,.-.,y+) | Or < Y1 < b1, se Qt < Ye < bt; (43) 


here a; and pj are integers in the range 0 < a; < 8; < m, for 1 < j < t. The 


volume of R is clearly (81 — a1)... (£+ — at). To get the discrepancy DË, we 


imagine looking at all these sets R and finding the one with the greatest excess 


or deficiency of points (£n, ..., En+t-1)- 
An upper bound for the discrepancy can be found by using exponential sums. 
Let w = e?7*/™ be a primitive mth root of unity. If (x1,..., £+) and (y1,..., Yt) 


are two vectors with all components in the range 0 < zj, yj < m, we have 


5 yea yr uate +(e Ye) ue z= { m’ if (z1, es Tt) = (y1, nag Yt) 
Heat) 2 on 


O<u1,...,ue<m 


Therefore the number of vectors (£n,...,£n4t-1) in R for 0 < n < N, when R 
is defined by (43), can be expressed as 


1 uyn urt +HEntt—1 Ut wT Hurt tyu) 
m 
O<n<N 0<u1,...,ue<m ai<yi<B1  ar<ye<Bt 


When u1 =--: = w = 0 in this sum, we get N/m! times the volume of R; hence 


t ; 
we can express DË as the maximum over R of 


1 Ly Uy tH Ent —1Ut — (yiu te +Hyrur) 
N z W PET W . 
m 


USES OSU sre aıLlyı<bı ar<yr< fe 
(u1,...,ut)4(0,...,0) 
Since complex numbers satisfy |w + z| < |w| + |z| and |wz| = |w||z|, it follows 


that 


1 EAT 
Dy < max — 5 5 fae 5 wy Uut Hyen) glui, ..., Ue) 


m 
O<ur,.. Uu <mM far<yi<B1 ar<ye<Pe 
(u1,.-.,ut)A(0,.--,0) 


1 
Sen 5 max 5 Ja Dae wy Uut +yu)] oar... ug) 


O<u1,... U <M aılyı<bı ar<yt< Be 
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= 5 f(u, ..., ut) g(ur,-.-, Ut), (44) 


where 
= 1 PnUite-+%n4t-1Ut]. 
glut, ur) — EF Ww , 
0<n<N 
1 —(yruit-+ytut) 
f(u1,---, Ue) = max — J yes J w 
R m 
aıLyı<bı atLlyt<brt 


1 5 w veut i 
m 


at Sue <Be 


= 5 GO os 


m 
an Sy <A1 


II 
B 
& 
p4 


Both f and g can be simplified further in order to get a good upper bound on 
DË. We have 


2 1 


T mjw” -1|  msin(ru/m) 


LE | 


m w¥— 1 
ax<y<B 


| 1 w BY — woo 


when u Æ 0, and the sum is < 1 when u = 0; hence 


f(u... Ut) < r(ur,..-, us), (45) 
where i 
r(u1,...,Ut) = — (46) 
pis msin(tTuk/m) 
uk ZO 


Furthermore, when (£n) is generated modulo m by a linear congruential se- 
quence, we have 


Enur tes + Enyt-1Ut = Enu + (Alp +e)uz+: + (alban tela’? +---+1)) uy 


= (u; +au2+:: -+a tu)ant h(ui,..., ut) 


where h(u1,..., ut) is independent of n; hence 

glun) = |E J wta (47) 
ri ’ N 7 
O<n<N 
where 

qlui,..., Uz) = Uy + auz +: + atlu. (48) 
Now here is where the connection to the spectral test comes in: We will show 
that the sum g(u1,..., u+) is rather small unless q(u1,..., ut) = 0 (modulo m); 
in other words, the contributions to (44) arise mainly from the solutions to (15). 
Furthermore exercise 27 shows that r(u1,..., u+) is rather small when (u1, ..., uz) 


is a “large” solution to (15). Hence the discrepancy DË will be rather small 
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when (15) has only “large” solutions, namely when the spectral test is passed. 
Our remaining task is to quantify these qualitative statements by making careful 
calculations. 

In the first place, let’s consider the size of g(u1,..., ut). When N = m, 


so that the sum (47) is over an entire period, we have g(u1,..., u+) = 0 except 
when (u1,..., ut) satisfies (15), so the discrepancy is bounded above in this case 
by the sum of r(u1,...,u¢) taken over all the nonzero solutions of (15). But 
let’s consider also what happens in a sum like (47) when N is less than m and 
q(u1,..-, Uz) is not a multiple of m. We have 
w= OS a te 
0<n<N o<n<n | o<kem 0<j<m 
1 _nk 
= © ($ Yo) 50 (a) 
0<k<m O<n<N 
where 
Ski = 5 gree, (50) 
0<j<m 
Now Sx = w~"* Sko, so |Ski| = |Sko| for all l, and we can calculate this common 
value by further exponential-summery: 
1 
|Sxol? = rs XO [Sal 
0<l<m 
= ey ee 
am 0<l<m 0<j<m O0<i<m 
= t 5 wk 5 witi i+ 
m O<i j <m 0<l<m 
ae 5 5 wik y g(t —1)rii tla —1)c/ (a1), 
m O<i<m i<gj<m+i 0<l<m 


Let s be minimum such that aê = 1 (modulo m), and let 
s' = (aè — 1)c/(a — 1) mod m. 


Then s is a divisor of m (see Lemma 3.2.1.2P), and £n+js = £n+js' (modulo m). 
The sum on / vanishes unless j — 7 is a multiple of s, so we find that 


2 isk+js’ 
[Skol =m > SETAE 
0<j<m/s 


We have s’ = q's where q’ is relatively prime to m (see exercise 3.2.1.2—21), so 
it turns out that 


gale 0 if k + q £0 (modulo m/s), en 
WO m/s if k+q! =0 (modulo m/s). 5 
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Putting this information back into (49), and recalling the derivation of (45), 


shows that 
D or 


0<n<N 


< a dire) (52) 


where the sum is over 0 < k < m such that k+q’' = 0 (modulo m/s). Exercise 25 
can now be used to estimate the remaining sum, and we find that 


La 


0<n<N 


< 2s ms+o( 2). (53) 


The same upper bound applies also to | X o<n< y w17" | for any q Æ 0 (modulo m), 
since the effect is to replace m in this derivation by a divisor of m. In fact, the 
upper bound gets even smaller when q has a factor in common with m, since s 
and m/,/s generally become smaller. (See exercise 26.) 

We have now proved that the g(u1,...,u¢) part of our upper bound (44) on 
the discrepancy is small, if N is large enough and if (u1, ..., uz) does not satisfy 
the spectral test congruence (15). Exercise 27 proves that the f(u1,..., ut) 
part of our upper bound is small, when summed over all the nonzero vectors 
(u1,..., u4) satisfying (15), provided that all such vectors are far away from 
(0,...,0). Putting these results together leads to the following theorem of 
Niederreiter: 


Theorem N. Let (Xn) be a linear congruential sequence (Xo, a, c, m) of period 
length m > 1, and let s be the least positive integer such that a° = 1 (modulo m). 
Then the t-dimensional discrepancy DË corresponding to the first N values 
of (Xn), as defined in (42), satisfies 


po =o (Suerte) o (nee) + O((logm)* max); (54) 


DË) = O((logm)! rmax)- (55) 


Here rmax is the maximum value of the quantity r(u1,...,uz) defined in (46), 
taken over all nonzero integer vectors (u1,..., Uz) satisfying (15). 


Proof. The first two O-terms in (54) come from vectors (u1,..., ut) in (44) 
that do not satisfy (15), since exercise 25 proves that f(ui,..., uz) summed over 
all (u1,...,uz) is O(((2/m)Inm)') and exercise 26 bounds each g(u1,..., wz). 
(These terms are missing from (55) since g(u1,..., us) = 0 in that case.) The 
remaining O-term in (54) and (55) comes from nonzero vectors (u1,..., Uz) that 
do satisfy (15), using the bound derived in exercise 27. (By examining this 
proof carefully, we could replace each O in these formulas by an explicit function 
oft.) | 


Eq. (55) relates to the serial test in ¢ dimensions over the entire period, 
while Eq. (54) gives us useful information about the distribution of the first N 
generated values when N is less than m, provided that N is not too small. 
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Notice that (54) will guarantee low discrepancy only when s is sufficiently large, 
otherwise the m/,/s term will dominate. If m = pi’... p& and gcd(a — 1, m) = 
př ... pf”, then s equals pai ... pf by Lemma 3.2.1.2P; thus, the largest 
values of s correspond to high potency. In the common case m = 2° and a = 5 
(modulo 8), we have s = $m, so DË is O (ym (log m)t+1/N) +O ((log m)‘rmax)- 
It is not difficult to prove that 


1 
V8 vi 


(see exercise 29). Therefore Eq. (54) says in particular that the discrepancy will 
be low in t dimensions if the spectral test is passed and if N is somewhat larger 
than ym (log m)**1. 

In a sense Theorem N is almost too strong, for the result in exercise 30 shows 
that linear congruential sequences like those in lines 8 and 13 of Table 1 have a 
discrepancy of order (log m)?/m in two dimensions. The discrepancy in this case 
is extremely small in spite of the fact that there are parallelogram-shaped regions 
of area ~ 1/y/m containing no points (Un, Un+1). The fact that discrepancy can 
change so drastically when the points are rotated warns us that the serial test 
may not be as meaningful a measure of randomness as the rotation-invariant 
spectral test. 


< 


Tmax 


(56) 


G. Historical remarks. In 1959, while deriving upper bounds for the error 
in the evaluation of t-dimensional integrals by the Monte Carlo method, N. M. 
Korobov devised a way to rate the multiplier of a linear congruential sequence. 
His rather complicated formula is related to the spectral test, since it is strongly 
influenced by “small” solutions to (15); but it is not quite the same. Korobov’s 
test has been the subject of an extensive literature, surveyed by Kuipers and 
Niederreiter in Uniform Distribution of Sequences (New York: Wiley, 1974), §2.5. 

The spectral test was originally formulated by R. R. Coveyou and R. D. 
MacPherson [JACM 14 (1967), 100-119], who introduced it in an interesting 
indirect way. Instead of working with the grid structure of successive points, 
they considered random number generators as sources of t-dimensional “waves.” 
The numbers y£? +--- +g? such that xı +---+ atls, = 0 (modulo m) in 
their original treatment were the wave “frequencies,” or points in the “spectrum” 
defined by the random number generator, with low-frequency waves being the 
most damaging to randomness; hence the name spectral test. Coveyou and 
MacPherson introduced a procedure analogous to Algorithm S for performing 
their test, based on the principle of Lemma A. However, their original procedure 
(which used matrices UUT and VV" instead of U and V) dealt with extremely 
large numbers; the idea of working directly with U and V was independently sug- 
gested by F. Janssens and by U. Dieter. [See Math. Comp. 29 (1975), 827-833.| 

Several other authors pointed out that the spectral test could be understood 
in far more concrete terms; by introducing the study of the grid and lattice struc- 
tures corresponding to linear congruential sequences, the fundamental limitations 
on randomness became graphically clear. See G. Marsaglia, Proc. Nat. Acad. Sci. 


3.3.4 THE SPECTRAL TEST 115 


61 (1968), 25-28; W. W. Wood, J. Chem. Phys. 48 (1968), 427; R. R. Coveyou, 
Studies in Applied Math. 3 (Philadelphia: SIAM, 1969), 70-111; W. A. Beyer, 
R. B. Roof, and D. Williamson, Math. Comp. 25 (1971), 345-360; G. Marsaglia 
and W. A. Beyer, Applications of Number Theory to Numerical Analysis, edited 
by S. K. Zaremba (New York: Academic Press, 1972), 249-285, 361-370. 

R. G. Stoneham showed, by using estimates of exponential sums, that p!/2+¢ 
or more elements of the sequence a*X 9 mod p have asymptotically small dis- 
crepancy, when a is a primitive root modulo the prime p [Acta Arithmetica 22 
(1973), 371-389]. This work was extended as explained above in a number of 
papers by Harald Niederreiter [Math. Comp. 28 (1974), 1117-1132; 30 (1976), 
571-597; Advances in Math. 26 (1977), 99-181; Bull. Amer. Math. Soc. 84 
(1978), 957-1041]. See also Niederreiter’s book Random Number Generation 
and Quasi-Monte Carlo Methods (Philadelphia: SIAM, 1992). 


EXERCISES 


1. [M10] To what does the spectral test reduce in one dimension? (In other words, 
what happens when t = 1?) 


2. [HM20] Let Vi, ..., Vi be linearly independent vectors in t-space, let Lo be the 
lattice of points defined by (10), and let Ui, ..., Us be defined by (19). Prove that the 
maximum distance between (t—1)-dimensional hyperplanes, over all families of parallel 
hyperplanes that cover Lo, is 1/min{f(21,...,2+)'/? | (a1,..., +) Æ (0,...,0)}, where 
f is defined in (17). 

3. [M24] Determine v3 and v4 for all linear congruential generators of potency 2 and 
period length m. 


4. [M23] Let u11, u12, U21, U22 be elements of a 2 x 2 integer matrix such that 
u11 + aui2 = u21 + au22 = 0 (modulo m) and wiiu22 — u2iti2 = M. 


a) Prove that all integer solutions (y1, y2) to the congruence yi +ay2 = 0 (modulo m) 
have the form (y1, y2) = (1U11+2%2U21, 21U12+22U22) for integer x1, £2. 

b) If, in addition, 2}ui1u21 + ui2ue2| < uty +u? < ud, + ude, prove that (y1, y2) = 

u11, u12) minimizes y? + y2 over all nonzero solutions to the congruence. 

5. [M30] Prove that steps S1 through S3 of Algorithm S correctly perform the spec- 
tral test in two dimensions. [Hint: See exercise 4, and prove that (h’+h)?+(p’ +p)? > 
h? + p° at the beginning of step S2.] 

6. [M30] Let ao, a1, ..., at—ı be the partial quotients of a/m as defined in Section 
3.3.3, and let A = maxo<j<t aj. Prove that u2 > 27/(A+1+4+1/A). 

7. [HM22] Prove that questions (a) and (b) following Eq. (23) have the same solution 
for real values of qi, ..., @j—1, Gj+1; +--+; qt (see (24) and (26)). 


8. [M18] Line 10 of Table 1 has a very low value of u2, yet u3 is quite satisfactory. 
What is the highest possible value of u3 when u2 = 107° and m = 101°? 


9. [HM32] (C. Hermite, 1846.) Let f(z1,..., £t) be a positive definite quadratic 
form, defined by the matrix U as in (17), and let 0 be the minimum value of f at 
nonzero integer points. Prove that 0 < (4)? |det U|?/*. [Hints: If W is any integer 
matrix of determinant 1, the matrix WU defines a form equivalent to f; and if S is 
any orthogonal matrix (that is, if S~' = ST), the matrix US defines a form identically 
equal to f. Show that there is an equivalent form g whose minimum @ occurs at 
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(1,0,...,0). Then prove the general result by induction on t, writing g(v1,...,2+) = 
O(a1 + Bote +--+ bext)? + h(ae,..., £+) where h is a positive definite quadratic form 
in t — 1 variables.] 


10. [M28] Let yı and ye be relatively prime integers such that yı +ay2 = 0 (modulo m) 
and y? +y3 < ./4/3m. Show that there exist integers ui and uz such that u1 +auz = 0 
(modulo m), wiy2 — uzyı = M, 2|uryr + usy2| < min(uj+u3, y? +y2), and (u? + u3) x 
(yi + y3) > m?. (Hence v3 = min(uj+u3, yi+y3) by exercise 4.) 

> 11. [HM30] (Alan G. Waterman, 1974.) Invent a reasonably efficient procedure that 
computes multipliers a = 1 (modulo 4) for which there exists a relatively prime solution 
to the congruence yı + ay2 = 0 (modulo m) with y? + y3 = \/4/3m—«, where € > 0 is 
as small as possible, given m = 2°. (By exercise 10, this choice of a will guarantee that 
v3 > m?/(yi + y3) > /3/4m, and there is a chance that v3 will be near its optimum 
value \/4/3m. In practice we will compute several such multipliers having small e€, 
choosing the one with best spectral values v2, v3, ....) 


12. [HM23] Prove, without geometrical handwaving, that any solution to question (b) 
following Eq. (23) must also satisfy the set of equations (26). 


13. [HM22] Lemma A uses the fact that U is nonsingular to prove that a positive 
definite quadratic form attains a definite, nonzero minimum value at nonzero integer 
points. Show that this hypothesis is necessary, by exhibiting a quadratic form (19) 
whose matrix of coefficients is singular, and for which the values of f(x1,...,x¢) get 
arbitrarily near zero (but never reach it) at nonzero integer points (£1, ..., ®t). 


14. [24] Perform Algorithm S by hand, for m = 100, a = 41, T = 3. 


> 15. [M20] Let U be an integer vector satisfying (15). How many of the (t — 1)- 
dimensional hyperplanes defined by U intersect the unit hypercube {(x1,...,x+) | 
O<a;<1forl < j < t}? (This is approximately the number of hyperplanes in 
the family that will suffice to cover Lo.) 


16. [M30] (U. Dieter.) Show how to modify Algorithm S in order to calculate the 
minimum number N, of parallel hyperplanes intersecting the unit hypercube as in 
exercise 15, over all U satisfying (15). [Hint: What are appropriate analogs to positive 
definite quadratic forms and to Lemma A?] 


17. [20] Modify Algorithm S so that, in addition to computing the quantities 1, it 
outputs all integer vectors (wi,..., u+) satisfying (15) such that uf +--- +u? = v7, for 
2<t<T. 


18. [M30] This exercise is about the worst case of Algorithm S. 

a) By considering “combinatorial matrices,” whose elements have the form y + £ôij 
(see exercise 1.2.3-39), find 3 x 3 matrices of integers U and V satisfying (29) such 
that the transformation of step S5 does nothing for any j, but the corresponding 
values of zķ in (31) are so huge that exhaustive search is out of the question. (The 
matrix U need not satisfy (28); we are interested here in arbitrary positive definite 
quadratic forms of determinant m.) 

b) Although transformation (23) is of no use for the matrices constructed in (a), find 
another transformation that does produce a substantial reduction. 


> 19. [HM25] Suppose step S5 were changed slightly, so that a transformation with 
q = 1 would be performed when 2V;-V; = Vj- Vj. (Thus, q = |(Vi- Vj /V;-V;) + 3] 
whenever i 4 j.) Would it be possible for Algorithm S to get into an infinite loop? 
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20. [M23] Discuss how to carry out an appropriate spectral test for linear congruential 
sequences having c = 0, Xo odd, m = 2°, a mod 8 = 3 or 5. (See exercise 3.2.1.2-9.) 


21. [M20] (R. W. Gosper.) A certain application uses random numbers in batches of 
four, but “throws away” the second of each set. How can we study the grid structure 
of {= (Xan, X4n+42, Xan+s3)}, given a linear congruential generator of period m = 2°? 


22. [M46] What is the best upper bound on ps3, given that u2 is very near its 
maximum value ,/4/37? What is the best upper bound on u2, given that u3 is very 
near its maximum value $nv2? 


23. [M46] Let Ui, V; be vectors of real numbers with U; - Vj = dij for 1 < i,j < t, 
and such that U;- U; = 1, 2\Ui-U;| < 1, 2|Vi-Vj| < Vj- Vj for i # j. How large 
can V,-V; be? (This question relates to the bounds in step S7, if both (23) and the 
transformation of exercise 18(b) fail to make any reductions. The maximum value 
known to be achievable is (t + 2)/3, which occurs when U1 = hh, Uj = $+ $V31;, 
Vi = h- (h + + h)/V3, Vj = 21;/V3, for 2 < j < t, where (L,..., T+) is the 
identity matrix; this construction is due to B. V. Alexeev.) 


24. [M28] Generalize the spectral test to second-order sequences of the form Xn = 
(aXn-1+bXn—2) mod p, having period length p? — 1. (See Eq. 3.2.2-(8).) How should 
Algorithm S be modified? 

25. [HM24] Let d be a divisor of m and let 0 < q < d. Prove that ` r(k), summed 


over all 0 < k < m such that k mod d = q, is at most (2/dm)In(m/d) + O(1). (Here 
r(k) is defined in Eq. (46) when t = 1.) 


26. [M22] Explain why the derivation of (53) leads to a similar bound on 


for 0<q<™m. 


27. [HM39| (E. Hlawka, H. Niederreiter.) Let r(ui,...,u¢) be the function defined 
in (46). Prove that }*r(ui,...,ue), summed over all 0 < wi,...,ue < m such that 
(u1,..., ut) # (0,...,0) and (15) holds, is at most 2( (7 + 2r lg m)' Tmax); where rmax 
is the maximum term r(ui,..., ut) in the sum. 


28. [M28] (H. Niederreiter.) Find an analog of Theorem N for the case m = prime, 
c = 0, a = primitive root modulo m, Xo # 0 (modulo m). [Hint: Your exponential 
sums should involve ¢ = e27/(m—1) as well as w.] Prove that in this case the “average” 
primitive root has discrepancy DË, = O (t(log m)‘/y(m — 1)), hence good primitive 
roots exist for all m. 


29. [HM22] Prove that the quantity rmax of exercise 27 is never larger than 1/(v8 vs). 


30. [M33] (S. K. Zaremba.) Prove that rmax = O(max(a1,...,a@s)/m) in two dimen- 
sions, where ai, ..., Qs are the partial quotients obtained when Euclid’s algorithm 
is applied to m and a. [Hint: We have a/m = //a1,...,as//, in the notation of 
Section 4.5.3; apply exercise 4.5.3-42.] 


31. [HM48] (I. Borosh and H. Niederreiter.) Prove that for all sufficiently large m 
there exists a number a relatively prime to m such that all partial quotients of a/m 
are < 3. Furthermore the set of all m satisfying this condition but with all partial 
quotients < 2 has positive density. 
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> 32. [M21] Let mı = 2°! — 1 and mz = 2°! — 249 be the moduli of generator (38). 
a) Show that if Un = (Xn/m1 — Yn/m2) mod 1, we have Un © Zn/mı. 
b) Let Wo = (Xomz — Yomi) mod m and W,,41 = aWn mod m, where a and m have 
the values stated in the text following (38). Prove that there is a simple relation 
between Wn and Un. 


N In the next edition of this book, I plan to introduce a new Section 3.3.5, 

£ entitled “The L? Algorithm.” It will be a digression from the general topic of 
Random Numbers, but it will continue the discussion of lattice basis reduction in 
Section 3.3.4. Its main topic will be the now-classic algorithm of A. K. Lenstra, 
H. W. Lenstra, Jr., and L. Lovász [Math. Annalen 261 (1982), 515-534] for 
finding a near-optimum set of basis vectors, and improvements to that algorithm 
made subsequently by other researchers. Examples of the latter can be found 
in the following papers and their bibliographies: M. Seysen, Combinatorica 13 
(1993), 363-375; C. P. Schnorr and H. H. Hörner, Lecture Notes in Comp. Sci. 
921 (1995), 1-12. 
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3.4. OTHER TYPES OF RANDOM QUANTITIES 


WE HAVE NOW SEEN how to make a computer generate a sequence of numbers 
Uo, Ui, U2, ... that behaves as if each number were independently selected 
at random between zero and one with the uniform distribution. Applications of 
random numbers often call for other kinds of distributions, however; for example, 
if we want to make a random choice from among k alternatives, we want a 
random integer between 1 and k. If some simulation process calls for a random 
waiting time between occurrences of independent events, a random number with 
the exponential distribution is desired. Sometimes we don’t even want random 
numbers — we want a random permutation (a random arrangement of n objects) 
or a random combination (a random choice of k objects from a collection of n). 

In principle, any of these other random quantities can be obtained from the 
uniform deviates Up, U1, U2, ...; people have devised a number of important 
“random tricks” for the efficient transformation of uniform deviates. A study of 
these techniques also gives us insight into the proper use of random numbers in 
any Monte Carlo application. 

It is conceivable that someday somebody will invent a random number 
generator that produces one of these other random quantities directly, instead of 
getting it indirectly via the uniform distribution. But no direct methods have as 
yet proved to be practical, except for the “random bit” generator described in 
Section 3.2.2. (See also exercise 3.4.1-31, where the uniform distribution is used 
primarily for initialization, after which the method is almost entirely direct.) 

The discussion in the following section assumes the existence of a random 
sequence of uniformly distributed real numbers between zero and one. A new 
uniform deviate U is generated whenever we need it. These numbers are usually 
represented in a computer word with the radix point assumed at the left. 


3.4.1. Numerical Distributions 


This section summarizes the best techniques known for producing numbers from 
various important distributions. Many of the methods were originally suggested 
by John von Neumann in the early 1950s, and they have gradually been improved 
upon by other people, notably George Marsaglia, J. H. Ahrens, and U. Dieter. 


A. Random choices from a finite set. The simplest and most common type 
of distribution required in practice is a random integer. An integer between 0 
and 7 can be extracted from three bits of U on a binary computer; in such a 
case, these bits should be extracted from the most significant (left-hand) part 
of the computer word, since the least significant bits produced by many random 
number generators are not sufficiently random. (See the discussion in Section 
TALL) 

In general, to get a random integer X between 0 and k — 1, we can multiply 
by k, and let X = |kU |. On MIX, we would write 


LDA U (1) 
MUL K 
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and after these two instructions have been executed the desired integer will 
appear in register A. If a random integer between 1 and k is desired, we add one 
to this result. (The instruction ‘INCA 1’ would follow (1).) 

This method gives each integer with nearly equal probability. There is a 
slight error because the computer word size is finite (see exercise 2); but the 
error is quite negligible if k is small, for example if k/m < 1/10000. 

In a more general situation we might want to give different weights to 
different integers. Suppose that the value X = xı is to be obtained with 
probability pı, and X = x2 with probability p2, ..., X = £k with probability px. 
We can generate a uniform number U and let 


zı, LOS <p; 
z2, if pı < U < pı + p2; 


zk, if pp +p2 +: +pk-ı <U<l1. 


(Note that pı + po +: + pk = 1.) 

There is a “best possible” way to do the comparisons of U against various 
values of pı + po +-+ + ps, as implied in (2); this situation is discussed in 
Section 2.3.4.5. Special cases can be handled by more efficient methods; for 
example, to obtain one of the eleven values 2, 3, ..., 12 with the respective “dice” 


probabilities nce 2, Eeg Ê, bake 2, 36 we could compute two independent 
random integers between 1 and 6 and add them together. 
However, there is actually a faster way to select £1, ..., £k with arbitrarily 


given probabilities, based on an ingenious approach introduced by A. J. Walker 
[Electronics Letters 10,8 (1974), 127-128; ACM Trans. Math. Software 3 (1977), 
253-256]. Suppose we form kU and consider the integer part K = |kU| and 
fraction part V = (kU) mod 1 separately; for example, after the code (1) we will 
have K in register A and V in register X. Then we can always obtain the desired 
distribution by doing the operations 


if V < Pg then X + zg} otherwise X + Yx, (3) 


for some appropriate tables (P9, ..., Pk—1) and (Yo,..., Yg—1). Exercise 7 shows 
how such tables can be computed in general. Walker’s method is sometimes 
called the method of “aliases.” 

On a binary computer it is usually helpful to assume that k is a power of 2, 
so that multiplication can be replaced by shifting; this can be done without loss 
of generality by introducing additional x’s that occur with probability zero. For 
example, let’s consider dice again; suppose we want X = j to occur with the 
following 16 probabilities: 


j=0 12 3 4 5 6 7 8 9 10 11 12 13 14 15 


T 1 2 3 4 5 6 5 4 3 2 1 
p;=0 0 35 36 36 36 36 36 36 36 36 36 35 0 0 0 
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We can do this using (3), if k = 16 and 2;41 = j for 0 < j < 16, and if the P 
and Y tables are set up as follows: 


j=0 12 3 4 5 6 7 8 9 1011 12 13 14 15 
P=00% 85141117785000 
Yeon 97 4x6xxx84710678 


(When P; = 1, Y; is not used.) For example, the value 7 occurs with probability 
b ((1— Po) + Py + (1— Pur) + (1 — Pi4)) = $ as required. It is a peculiar way 
to throw dice, but the results are indistinguishable from the real thing. 

The probabilities p; can be represented implicitly by nonnegative weights 
W1, W2, ..., Wk; if we denote the sum of the weights by W, then p; = w,;/W. 
In many applications the individual weights vary dynamically. Matias, Vitter, 
and Ni [SODA 4 (1993), 361-370] have shown how to update a weight and 


generate X in constant expected time. 


B. General methods for continuous distributions. The most general real- 
valued distribution can be expressed in terms of its “distribution function” F(a), 
which specifies the probability that a random quantity X will not exceed zx: 


F(a) =Pr(X < x). (4) 
This function always increases monotonically from zero to one; that is, 
F(a) < F(x2), if vy < 29; F(—œ) = 0, F(+o0) =1. (5) 


Examples of distribution functions are given in Section 3.3.1, Fig. 3. If F(x) 
is continuous and strictly increasing (so that F(a,) < F(a#2) when zı < 29), 
it takes on all values between zero and one, and there is an inverse function 
Fl-4(y) such that, for 0 < y <1, 


y = F(x) if and only if «=F 4 (y). (6) 


In general, when F(x) is continuous and strictly increasing, we can compute a 
random quantity X with distribution F(x) by setting 


X =F 0), (7) 


where U is uniform. This works because the probability that X < x is the prob- 
ability that F!-1(U) < x, namely the probability that U < F(x), namely F(x). 

The problem now reduces to one of numerical analysis, namely to find good 
methods for evaluating F!-'(U) to the desired accuracy. Numerical analysis 
lies outside the scope of this seminumerical book; yet a number of important 
shortcuts are available to speed up the general approach of (7), and we will 
consider them here. 

In the first place, if X; is a random variable having the distribution F(x) 
and if Xə is an independent random variable with the distribution F(x), then 


max(X ,,X2) has the distribution F\(x)F (2), 


8 
min(Xı, X2) has the distribution F(a) + Fo(x) — Fi(x)Fo(z). (8) 
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(See exercise 4.) For example, a uniform deviate U has the distribution F(x) = z, 
for 0 < a < 1; if Ui, Us, ..., U; are independent uniform deviates, then 
max(U,,U2,...,U;) has the distribution function F(x) = zt, for 0 < x < 1. 
This formula is the basis of the “maximum-of-t test” given in Section 3.3.2; the 
inverse function is F‘-'(y) = Wy. In the special case t = 2, we see therefore 
that the two formulas 


X=VU and X = max(U;, U2) (9) 


will give equivalent distributions to the random variable X, although this is not 
obvious at first glance. We need not take the square root of a uniform deviate. 
The number of tricks like this is endless: Any algorithm that employs random 
numbers as input will give a random quantity with some distribution as output. 
The problem is to find general methods for constructing the algorithm, given the 
distribution function of the output. Instead of discussing such methods in purely 
abstract terms, we shall study how they can be applied in important cases. 


C. The normal distribution. Perhaps the most important nonuniform, con- 
tinuous distribution is the normal distribution with mean zero and standard 


deviation one: 
1 od 2 
F(z) = — e! /2 dt. 10 
O- Tel a 


The significance of this distribution was indicated in Section 1.2.10. In this case 
the inverse function FI! is not especially easy to compute; but we shall see 
that several other techniques are available. 


1) The polar method, due to G. E. P. Box, M. E. Muller, and G. Marsaglia. 
(See Annals Math. Stat. 29 (1958), 610-611; and Boeing Scientific Res. Lab. 
report D1-82-0203 (1962).) 


Algorithm P (Polar method for normal deviates). This algorithm calculates 
two independent normally distributed variables, X, and Xə. 


P1. [Get uniform variables.] Generate two independent random variables, Uj; 
and U2, uniformly distributed between zero and one. Set V; + 2U; — 1, 
V + 2U, — 1. (Now V, and Vz are uniformly distributed between —1 and 
+1. On most computers it will be preferable to have Vı and V2 represented 
in floating point form.) 

P2. [Compute S.] Set S + VÊ + V2. 

P3. [Is S > 1?] If S > 1, return to step P1. (Steps P1 through P3 are executed 
1.27 times on the average, with a standard deviation of 0.59; see exercise 6.) 


P4. [Compute X1, X2.] If S = 0, set Xı + X2 + 0; otherwise set 


= —21n S 
Ay a, X2 & V24) ~ (11) 


These are the normally distributed variables desired. | 
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To prove the validity of this method, we use elementary analytic geometry 
and calculus: If S < 1 in step P3, the point in the plane with Cartesian 
coordinates (V1, V2) is a random point uniformly distributed inside the unit circle. 
Transforming to polar coordinates Vj = RcosO, V2 = Rsin O, we find 


S= R, X,=vV-2InScosO, Xp =V/-2InSsinO. 


Using also the polar coordinates X,; = R’ cos O', X2 = R'sin O', we find that 
Oo’ = O and R’ = y—2ln S. It is clear that R’ and O’ are independent, since 
R and © are independent inside the unit circle. Also, O’ is uniformly distributed 
between 0 and 27; and the probability that R’ < r is the probability that 
—2InS < r?, namely the probability that S > e-r°/2 This equals 1 — e-"7/2, 
since S = R? is uniformly distributed between zero and one. The probability 
that R’ lies between r and r + dr is therefore the differential of 1 — e-7*/2, 
namely re-"/2dr. Similarly, the probability that O’ lies between 0 and 6 + dé 
is (1/27) dO. The joint probability that Xı < xı and that X2 < x2 now can be 
computed; it is 


f T oma db 
{(r,0) | rcos0<a1, rsinOd<ax2} 27 
1 


2T S(x,y) |2<ar, y<aa} 


1 we 2 1 Ka 2 
= e= —x"/2 =n —y/2 
(Va foe w) (Vx f e iy). 


This calculation proves that X, and Xə are independent and normally distrib- 
uted, as desired. 


e~ 4971/4 dr dy 


2) The rectangle-wedge-tail method, introduced by G. Marsaglia. Here we use 
the function 


F(x) = erf(a//2) = EL etl dt, z >0, (12) 


which gives the distribution of the absolute value of a normal deviate. After X 
has been computed according to distribution (12), we will attach a random sign 
to its value, and this will make it a true normal deviate. 

The rectangle-wedge-tail approach is based on several important general 
techniques that we shall explore as we develop the algorithm. The first key idea 
is to regard F(x) as a mixture of several other functions, namely to write 


F(x) = pi F(x) + poFo(ax) +--+ + pr F(z), (13) 


where Fi, Fz, ..., Fn are appropriate distributions and p1, po, ..., Pn are 
nonnegative probabilities that sum to 1. If we generate a random variable X by 
choosing distribution Fj with probability pj, it is easy to see that X will have 
distribution F overall. Some of the distributions F}(x) may be rather difficult to 
handle, even harder than F itself, but we can usually arrange things so that the 
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Fig. 9. The density function divided into 31 parts. The area of each part represents 
the average number of times a random number with that density is to be computed. 


probability pj is very small in that case. Most of the distributions F(x) will be 
quite easy to accommodate, since they will be trivial modifications of the uniform 
distribution. The resulting method yields an extremely efficient program, since 
its average running time is very small. 

It is easier to understand the method if we work with the derivatives of the 
distributions instead of the distributions themselves. Let 


f(x@)=F (x), — fj(@) = F; (£) 
be the density functions of the probability distributions. Equation (13) becomes 


f(@) = pifi(@) + pafe(x) +--+ + Pnfn(2). (14) 


Each f;(a) is > 0, and the total area under the graph of f;(x) is 1; so there is 
a convenient graphical way to display the relation (14): The area under f(x) 
is divided into n parts, with the part corresponding to fj(x) having area pj. 
See Fig. 9, which illustrates the situation in the case of interest to us here, with 
f(a) = F'(x) = V/2/n e~*’/2. the area under this curve has been divided into n = 
31 parts. There are 15 rectangles, which represent pı fı (x£), ..., pis f1s(x); there 
are 15 wedge-shaped pieces, which represent pig fig(x), .--, p30 f30(a); and the 
remaining part p31 f31(2) is the “tail,” namely the entire graph of f(x) for x > 3. 

The rectangular parts fi(x), ..., fis(@) represent uniform distributions. 
For example, f3(a) represents a random variable uniformly distributed between 
z and 3. The altitude of p; f;(x) is f(7/5), hence the area of the jth rectangle is 


1 2 2 
SBS L f(j/5) = 4| — e7150 for 1 < j < 15. 
py = =f U/5)= e, forisjsi5 (15) 
In order to generate such rectangular portions of the distribution, we simply 
compute 


X= 7U+S, (16) 
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S s+h s s+h 


Fig. 10. Density functions for which Algorithm L may be used to generate random 
numbers. 


where U is uniform and S takes the value (j — 1)/5 with probability pj. Since 
pı +: + pis = .9183, we can use simple uniform deviates like this about 92 
percent of the time. 

In the remaining 8 percent, we will usually have to generate one of the 
wedge-shaped distributions Fig, ..., F30. Typical examples of what we need to 
do are shown in Fig. 10. When a < 1, the curved part is concave, and when 
x > 1 it is convex, but in each case the curved part is reasonably close to a 
straight line, and it can be enclosed in two parallel lines as shown. 

To handle these wedge-shaped distributions, we will rely on yet another 
general technique, von Neumann’s rejection method for obtaining a complicated 
density from another one that “encloses” it. The polar method described above is 
a simple example of such an approach: Steps P1—P3 obtain a random point inside 
the unit circle by first generating a random point in a larger square, rejecting it 
and starting over again if the point was outside the circle. 

The general rejection method is even more powerful than this. To generate a 
random variable X with density f, let g be another probability density function 


such that 
f(t) < eg(t) (17) 


for all t, where c is a constant. Now generate X according to density g, and also 
generate an independent uniform deviate U. If U > f(X)/cg(X), reject X and 
start again with another X and U. When the condition U < f(X)/cg(X) finally 
occurs, the resulting X will have density f as desired. | Proof: X < x will occur 
with probability p(x) = f7- (g(t) dt - f(t)/cg(t)) + qp(x£), where the quantity 
q = f% (g(t) dt- (1— f(t)/cg(t))) = 1— 1/c is the probability of rejection; hence 
p(x) = JZ f(t) at.] 

The rejection technique is most efficient when c is small, since there will be 
c iterations on the average before a value is accepted. (See exercise 6.) In some 
cases f(x)/cg(x) is always 0 or 1; then U need not be generated. In other cases 
if f(x)/cg(x) is hard to compute, we may be able to “squeeze” it between two 
bounding functions 


r(x) < f(x)/cg(a) < s(x) (18) 
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Fig. 11. Region of “acceptance” in Algorithm L. 


that are much simpler, and the exact value of f(x)/cg(x) need not be calculated 
unless r(x) < U < s(x). The following algorithm solves the wedge problem by 
developing the rejection method still further. 


Algorithm L (Nearly linear densities). This algorithm may be used to gen- 
erate a random variable X for any distribution whose density f(x) satisfies the 
following conditions (see Fig. 10): 


f(x) =0, for z < sand for x > s + h; 


a— b(x — s)/h < f(x) < b— b(a — s)/h, fors<a<s+h. G9) 


L1. [Get U < V.] Generate two independent random variables U and V, uni- 
formly distributed between zero and one. If U > V, exchange U & V. 


L2. [Easy case?] If V < a/b, go to L4. 


L3. [Try again?] If V > U + (1/6) f(s + hU), go back to step L1. (If a/b is close 
to 1, this step of the algorithm will not be necessary very often.) 


L4. [Compute X.] Set X + s+hU. I 


When step L4 is reached, the point (U,V) is a random point in the area 
shaded in Fig. 11, namely, 0 < U < V < U + (1/b)f(s + hU). Conditions (19) 
ensure that 


1 
; SU+5f(s+hU) S 1. 
Now the probability that X < s + hz, for 0 < x < 1, is the area that lies to the 
left of the vertical line U = x in Fig. 11, divided by the total area, namely 


x al stha 
f L A(s + hu) du I f L f(s + hu) du = | fo) av; 


therefore X has the correct distribution. 

With appropriate constants aj, bj, sj, Algorithm L will take care of the 
wedge-shaped densities f;+15 of Fig. 9, for 1 < j < 15. The final distribution, 
F31, needs to be treated only about one time in 370; it is used whenever a result 
X > 3 is to be computed. Exercise 11 shows that a standard rejection scheme 
can be used for this “tail.” We are ready to consider the procedure in its entirety: 


3.4.1 NUMERICAL DISTRIBUTIONS 127 


Yes 
M2. Rectangle? > 
¢ ey 


No 
-7N Wedge No 
M3. Wedge or tail? M4. Get U<V 
Tail Ya 
y Vv 
M7. Get supertail deviate M9. Attach sign 


v 


Fig. 12. The “rectangle-wedge-tail” algorithm for generating normal deviates. 


Algorithm M (Rectangle-wedge-tail method for normal deviates). For this 
algorithm we use auxiliary tables (Po,..., P31), (Qi,---,Qis), (Yo,---, Y31), 
(Zo, esy Z31), (S1, TET Sie), (Die, eae , D30), (Eie, seey E30), constructed as ex- 
plained in exercise 10; examples appear in Table 1. We assume that a binary 
computer is being used; a similar procedure could be worked out for decimal 
machines. 


M1. [Get U.] Generate a uniform random number U = (.bo9bib2...b¢)2. (Here 
the b’s are the bits in the binary representation of U. For reasonable 
accuracy, t should be at least 24.) Set y < bo. (Later, y will be used 
to determine the sign of the result.) 


M2. [Rectangle?| Set j < (b,b2b3b4bs)2, a binary number determined by the 
leading bits of U, and set f < (.beb7...b:)2, the fraction determined by 
the remaining bits. If f > Pj, set X — Y; + fZ; and go to M9. Otherwise 
if j < 15 (that is, bı = 0), set X + Sj + fQ; and go to M9. (This is an 
adaptation of Walker’s alias method (3).) 


M3. [Wedge or tail?] (Now 16 < j < 31, and each particular value j occurs with 
probability p;.) If j = 31, go to M7. 


M4. [Get U < V.] Generate two new uniform deviates, U and V; if U > V, 
exchange U + V. (We are now performing a special case of Algorithm L.) 
Set X + Sj_15 + tU. 


M5. [Easy case?] If V < D}, go to M9. 


128 RANDOM NUMBERS 3.4.1 


Table 1 
EXAMPLE OF TABLES USED WITH ALGORITHM M* 
P; Pje Qj Yj Yj+ı6 Zj Zj+i6 Sj+ı Dj+ı5  Ej+is 
000.067 0.00 0.59 0.20 0.21 0.0 


849 .161 .236 — 0.92 0.96 1.32 0.24 0.2 -505 25.00 
970 = .236 .206 — 5.86 —0.06 6.66 0.26 0.4 -173 12.50 
.855  .285 .234 — 0.58 0.12 1.38 0.28 0.6 .876 8.33 
.994  .308 .201 —33.16 1.31 34.96 0.29 0.8 .939 6.25 
.995 .304 .201 —39.51 0.31 41.31 0.29 1.0 .986 5.00 
.933  .280 .214 — 2.57 1.12 2.97 0.28 1.2 -995 4.06 
923.241 217 — 1.61 0.54 2.61 0.26 1.4 987 3.37 
727 197 275 0.67 0.75 0.73 0.25 1.6 979 2.86 
1.000 .152 -200 0.56 0.24 1.8 972 2.47 
10 691 112 -289 0.35 0.17 0.65 0.23 2.0 -966 2.16 
11 454 = .079 440 — 0.17 0.38 0.37 0.22 2.2 -960 1.92 
12 287 = .052 -698 0.92 —0.01 0.28 0.21 2.4 .954 1.71 
13 .174 .033 1.150 0.36 0.39 0.24 0.21 2.6 .948 1.54 
14 .101 .020 1.974 — 0.02 0.20 0.22 0.20 2.8 .942 1.40 
15 .057 .086 3.526 0.19 0.78 0.21 0.22 3.0 936 1.27 


oono Aone OS. 


*In practice, this data would be given with much greater precision; the table shows only enough 
figures so that interested readers will be able to test their own algorithms for computing the 
values more accurately. The values of Qo, Yo, Z9, Dis, and E15 are not used. 


M6. [Another try?] If V > U + E;(elF-17X°)/2 — 1), go back to step M4; 
otherwise go to M9. (This step is executed with low probability.) 


M7. [Get supertail deviate.] Generate two new independent uniform deviates, 
U and V, and set X + /9—2InV. 


MB. [Reject?] If UX > 3, go back to step M7. (This will occur only about 
one-twelfth as often as we reach step M8.) 


M9. [Attach sign.] If y = 1, set X + —X. | 


This algorithm is a very pretty example of mathematical theory intimately 
interwoven with programming ingenuity — a fine illustration of the art of com- 
puter programming! Only steps M1, M2, and M9 need to be performed most 
of the time, and the other steps aren’t terribly slow either. The first publica- 
tions of the rectangle-wedge-tail method were by G. Marsaglia, Annals Math. 
Stat. 32 (1961), 894-899; G. Marsaglia, M. D. MacLaren, and T. A. Bray, 
CACM 7 (1964), 4-10. Further refinements of Algorithm M have been developed 
by G. Marsaglia, K. Ananthanarayanan, and N. J. Paul, Inf. Proc. Letters 5 
(1976), 27-30. 


3) The odd-even method, due to G. E. Forsythe. An amazingly simple technique 
for generating random deviates with a density of the general exponential form 


f(x) = Ce" [a<r <b], (20) 
when 
0< h(x) <1 fora<a<b, (21) 


was discovered by John von Neumann and G. E. Forsythe about 1950. The idea 
is based on the rejection method described earlier, letting g(x) be the uniform 
distribution on [a..b): We set X + a + (b — a)U, where U is a uniform deviate, 
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and then we want to accept X with probability e~”(*). The latter operation 
could be done by comparing e~”? to V, or h(X) to — 1n V, when V is another 
uniform deviate, but the job can be done without applying any transcendental 
functions in the following interesting way. Set Vo + h(X), then generate uniform 
deviates Vi, V2, ... until finding some K > 1 with Vx_1 < Vg. For fixed X and k, 
the probability that A(X) > Vi >.--- > Vp is 1/k! times the probability that 
max(Vj,..., Vp) < h(X), namely h(X)*/k!; hence the probability that K = k is 
h(X)*-1/(k — 1)! — h(X)*/k!, and the probability that K is odd is 


W(X) WXP 
>D Can kl ) as las (22) 


k odd, k>1 


Therefore we reject X and try again if K is even; we accept X as a random 
variable with density (20) if K is odd. We usually won’t have to generate 
many V’s in order to determine K, since the average value of K (given X) 
is Sg PEK > k) = Epoo R(X )E/k! He) < e. 

Forsythe realized some years later that this approach leads to an efficient 
method for calculating normal deviates, without the need for any auxiliary 
routines to calculate square roots or logarithms as in Algorithms P and M. His 
procedure, with an improved choice of intervals [a..b) due to J. H. Ahrens and 
U. Dieter, can be summarized as follows. 


Algorithm F (Odd-even method for normal deviates). This algorithm generates 
normal deviates on a binary computer, assuming approximately t + 1 bits of 
accuracy. It requires a table of values dj = aj — aj—1, for 1 < j < t+ 1, where 
aj is defined by the relation 


2 2 1 
ae -2°/2 dz = — 
L e dx = 55. (23) 


F1. [Get U.] Generate a uniform random number U = (.bobi ...by)2, where bo, 
bı, ..., bg denote the bits in binary notation. Set Y «+ bo, j + 1, and a + 0. 

F2. [Find first zero b;.] If b; = 1, set a+ a + dj, j +} j +1, and repeat this 
step. (If j = t + 1, treat bj as zero.) 

F3. [Generate candidate.] (Now a = aj—1, and the current value of j occurs with 
probability ~ 273. We will generate X in the range [a;_1..a;), using the 
rejection method above, with h(x) = z?/2—a?/2 = y?/2+ay where y = x-a. 
Exercise 12 proves that h(x) < 1 as required in (21).) Set Y + d; times 
(-bj+1---bt)2 and V + (ZY +a)Y. (Since the average value of j is 2, there 
will usually be enough significant bits in (.bj+1 ..-b:)2 to provide decent 
accuracy. The calculations are readily done in fixed point arithmetic.) 

F4. [Reject?] Generate a uniform deviate U. If V < U, go on to step F5. 
Otherwise set V to a new uniform deviate; and repeat step F4 if the new V 
is < U. Otherwise (that is, if K is even, in the discussion above), replace U 
by a new uniform deviate (.bgb; ...b;)2 and go back to F3. 


F5. [Return X.] Set X + a +Y. If y = 1, set X+-xX. I 


( 


(0, 
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in the ratio-of-uniforms method 
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Values of dj for 1 < j < 47 appear in a paper by Ahrens and Dieter, Math. 
Comp. 27 (1973), 927-937; their paper discusses refinements of the algorithm 
that improve its speed at the expense of more tables. Algorithm F is attractive 
since it is almost as fast as Algorithm M and it is easier to implement. The 
average number of uniform deviates per normal deviate is 2.53947; R. P. Brent 
[CACM 17 (1974), 704-705] has shown how to reduce this number to 1.37446 
at the expense of two subtractions and one division per uniform deviate saved. 


4) Ratios of uniform deviates. There is yet another good way to generate 
normal deviates, discovered by A. J. Kinderman and J. F. Monahan in 1976. 
Their idea is to generate a random point (U,V) in the region defined by 


0<u<l, —2uvy In(1/u) < v < 2uy In(1/u), (24) 


and then to output the ratio X + V/U. The shaded area of Fig. 13 is the magic 

region (24) that makes this all work. Before we study the associated theory, let 

us first state the algorithm so that its efficiency and simplicity are manifest: 

Algorithm R (Ratio method for normal deviates). This algorithm generates 

normal deviates X. 

R1. [Get U,V.] Generate two independent uniform deviates U and V, where 
U is nonzero, and set X + y/8/e (V—4)/U. (Now X is the ratio of 
the coordinates (U. ; /8/e (V — s)) of arandom point in the rectangle that 
encloses the shaded region in Fig. 13. We will accept X if the corresponding 
point actually lies “in the shade,” otherwise we will try again.) 
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R2. [Optional upper bound test.] If X? < 5 —4e!/4U, output X and terminate 
the algorithm. (This step can be omitted if desired; it tests whether or not 
the selected point is in the interior region of Fig. 13, making it unnecessary 
to calculate a logarithm.) 

R3. [Optional lower bound test.] If X? > 4e7!3°/U + 1.4, go back to R1. (This 
step could also be omitted; it tests whether or not the selected point is 
outside the exterior region of Fig. 13, making it unnecessary to calculate a 
logarithm.) 


R4. [Final test.) If X? < —4InU, output X and terminate the algorithm. 
Otherwise go back to R1. J 


Exercises 20 and 21 work out the timing analysis; four different algorithms 
are analyzed, since steps R2 and R3 can be included or omitted depending on 
one’s preference. The following table shows how many times each step will be 
performed, on the average, depending on which of the optional tests is applied: 


Step Neither R2 only R3 only Both 


R1 1.369 1.369 1.369 1.369 
R2 0 1.369 0 1.369 (25) 
R3 0 0 1.369 0.467 
R4 1.369 0.467 1.134 0.232 


Thus it pays to omit the optional tests if there is a very fast logarithm operation, 
but if the log routine is rather slow it pays to include them. 

But why does it work? One reason is that we can calculate the probability 
that X < x, and it turns out to be the correct value (10). But such a calculation 
isn’t very easy unless one happens to hit on the right trick, and anyway it is 
better to understand how the algorithm might have been discovered in the first 
place. Kinderman and Monahan derived it by working out the following theory 
that can be used with any well-behaved density function f(x) [see ACM Trans. 
Math. Software 3 (1977), 257-260]. 

In general, suppose that a point (U,V) has been generated uniformly over 
the region of the (u, v)-plane defined by 


u> 0, u? < g(v/u) (26) 


for some nonnegative integrable function g. If we set X + V/U, the probability 
that X < x can be calculated by integrating du dv over the region defined by the 
two relations in (26) plus the auxiliary condition v/u < x, then dividing by the 
same integral without this extra condition. Letting v = tu, so that dv = udt, 


the integral becomes 
x a/ g(t) 1 x 
J at f udu = F g(t) dt. 
—oo 0 —oo 


Hence the probability that X < x is 


[soa / L O dt. (27) 


132 RANDOM NUMBERS 3.4.1 


The normal distribution comes out when g(t) = e~'/?; and the condition 
u? < g(v/u) simplifies in this case to (v/u)? < —4ln u. It is easy to see that the 
set of all such pairs (u,v) is entirely contained in the rectangle of Fig. 13. 

The bounds in steps R2 and R3 define interior and exterior regions with 
simpler boundary equations. The well-known inequality 


e > 1+rz, 


which holds for all real numbers x, can be used to show that 


1+lnc—cu < -lnu < 1/(cu)—-1+lnce (28) 


for any constant c > 0. Exercise 21 proves that c = e!/4 is the best possible 
constant to use in step R2. The situation is more complicated in step R3, and 
there doesn’t seem to be a simple expression for the optimum c in that case, 
but computational experiments show that the best value for R3 is ~ e!3°. The 
approximating curves (28) are tangent to the true boundary when u = 1/c. 

With an improved approximation to the acceptance region [see J. L. Leva, 
ACM Trans. Math. Software 18 (1992), 449-455] we can, in fact, reduce the 
expected number of logarithm computations to only 0.012. 

It is possible to obtain a faster method by partitioning the region into 
subregions, most of which can be handled more quickly. Of course, this means 
that auxiliary tables will be needed, as in Algorithms M and F. An interesting 
alternative that requires fewer auxiliary table entries has been suggested by 
Ahrens and Dieter in CACM 31 (1988), 1330-1337. 


5) Normal deviates from normal deviates. Exercise 31 discusses an interesting 
approach that saves time by working directly with normal deviates instead of 
basing everything on uniform deviates. This method, introduced by C. S. Wallace 
in 1996, has comparatively little theoretical support at the present time, but it 
has successfully passed a number of empirical tests. 


6) Variations of the normal distribution. So far we have considered the normal 
distribution with mean zero and standard deviation one. If X has this distribu- 
tion, then 

Y=ptox (29) 
has the normal distribution with mean u and standard deviation ø. Furthermore, 
if X, and Xə are independent normal deviates with mean zero and standard 
deviation one, and if 


Yı = pı +01%1, Yo = m + o2(pXı + V1- p? Xo), (30) 


then Yı and Y2 are dependent random variables, normally distributed with means 
Hı, H2 and standard deviations c1, c2, and with correlation coefficient p. (For a 
generalization to n variables, see exercise 13.) 


D. The exponential distribution. After uniform deviates and normal de- 
viates, the next most important random quantity is an exponential deviate. 
Such numbers occur in “arrival time” situations; for example, if a radioactive 
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substance emits alpha particles at a rate such that one particle is emitted every 
p seconds on the average, then the time between two successive emissions has 
the exponential distribution with mean u. This distribution is defined by the 
formula 

F(z) =1-—e7*/#, x>0. (31) 


1) Logarithm method. Clearly, if y = F(x) = 1—e7*/#, then z = FU (y) = 
—pln(1—y). Therefore —yIn(1—U) has the exponential distribution by Eq. (7). 
Since 1 — U is uniformly distributed when U is, we conclude that 


X = —uln U (32) 


is exponentially distributed with mean u. (The case U = 0 must be treated 
specially; we can substitute any convenient value e for 0, since the probability of 
this case is extremely small.) 


2) Random minimization method. We saw in Algorithm F that there are 
simple and fast alternatives to calculating the logarithm of a uniform deviate. 
The following especially efficient approach has been developed by G. Marsaglia, 
M. Sibuya, and J. H. Ahrens [see CACM 15 (1972), 876-877]: 


Algorithm S (Exponential distribution with mean u). This algorithm produces 
exponential deviates on a binary computer, using uniform deviates with (t+ 1)- 
bit accuracy. The constants 
ln2 (In2)? (In 2)* 
hen tE Peo, (33) 
should be precomputed, extending until Q[k] > 1 — 27+. 

S1. [Get U and shift.] Generate a (t + 1)-bit uniform random binary fraction 
U = (.bobıb2 . . . bt)2; locate the first zero bit bj, and shift off the leading j +1 
bits, setting U + (.bj41...bz)2. (As in Algorithm F, the average number of 
discarded bits is 2.) 

S2. [Immediate acceptance?] If U < ln2, set X + u(jln2 +U) and terminate 
the algorithm. (Note that Q[1] = ln 2.) 

S3. [Minimize.] Find the least k > 2 such that U < Q[k]. Generate k new 
uniform deviates U1, ..., Up and set V + min(U1,..., Up). 


S4. [Deliver the answer.] Set X + u(j+V)ln2. | 


Alternative ways to generate exponential deviates (for example, a ratio of 
uniforms as in Algorithm R) might also be used. 


E. Other continuous distributions. Let us now consider briefly how to 
handle some other distributions that arise reasonably often in practice. 
1) The gamma distribution of order a > 0 is defined by 


F(a) = Aw pa z> 0. (34) 
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When a = 1, this is the exponential distribution with mean 1; when a = Z, 
it is the distribution of 12, where Z has the normal distribution (mean 0, 
variance 1). If X and Y are independent gamma-distributed random variables, 
of order a and b, respectively, then X + Y has the gamma distribution of order 
a+b. Thus, for example, the sum of k independent exponential deviates with 
mean 1 has the gamma distribution of order k. If the logarithm method (32) 
is being used to generate these exponential deviates, we need compute only one 
logarithm: X + — ln (U1 ... Upg), where U1, ..., Up are nonzero uniform deviates. 
This technique handles all integer orders a; to complete the picture, a suitable 
method for 0 < a < 1 appears in exercise 16. 

The simple logarithm method is much too slow when a is large, since it 
requires |a] uniform deviates. Moreover, there is a substantial risk that the 
product U;...U),| will cause floating point underflow. For large a, the following 
algorithm due to J. H. Ahrens is reasonably efficient, and it is easy to write in 
terms of standard subroutines. [See Ann. Inst. Stat. Math. 13 (1962), 231—237.| 


Algorithm A (Gamma distribution of order a > 1). 


A1. [Generate candidate.] Set Y «+ tan(7U), where U is a uniform deviate, and 
set X + /2a—1Y +a-—1. (In place of tan(7U) we could use a polar 
method, calculating a ratio V2/V; as in step P4 of Algorithm P.) 

A2. [Accept?] If X < 0, return to Al. Otherwise generate a uniform deviate V, 
and return to Al if V > (1 + Y°) exp((a — 1)ln(X/(a — 1)) — v2a = 1Y). 
Otherwise accept X. J 


The average number of times step A1 is performed is < 1.902 when a > 3. 

There is also an attractive approach for large a based on the remarkable 
fact that gamma deviates are approximately equal to aX, where X is normally 
distributed with mean 1—1/(9a) and standard deviation 1/v9a; see E. B. Wilson 
and M. M. Hilferty, Proc. Nat. Acad. Sci. 17 (1931), 684-688; G. Marsaglia, 
Computers and Math. 3 (1977), 321-325.* 

For a somewhat complicated but significantly faster algorithm, which gener- 
ates a gamma deviate in about twice the time to generate a normal deviate, see 
J. H. Ahrens and U. Dieter, CACM 25 (1982), 47-54. This article contains an 
instructive discussion of the design principles used to construct the algorithm. 


2) The beta distribution with positive parameters a and b is defined by 


F(z) = ETO 1 t1 t) tdt, O<r<1. (35) 


Let Xı and Xə be independent gamma deviates of order a and b, respectively, 
and set X + Xı/(Xı + X2). Another method, useful for small a and b, is to set 


Y =U and Ye UI” 


repeatedly until Yı + Yo < 1; then X + Yı/(Yı + Yə). [See M. D. Jöhnk, 
Metrika 8 (1964), 5-15.] Still another approach, if a and b are integers and not 


* Change “+(3a — 1)” to “—(3a — 1)” in Step 3 of the algorithm on page 323. 
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too large, is to set X to the bth largest of a+ b—1 independent uniform deviates 
(see exercise 9 at the beginning of Chapter 5). See also the more direct method 
described by R. C. H. Cheng, CACM 21 (1978), 317-322. 


3) The chi-square distribution with v degrees of freedom (Eq. 3.3.1-(22)) is 
obtained by setting X + 2Y, where Y is a random variable having the gamma 
distribution of order v/2. 


4) The F-distribution (variance-ratio distribution) with vı and v2 degrees of 
freedom is defined by 


vy"? vgl? D((v1 + v2)/2) 
D(v4/2) P(v2/2) 


where x > 0. Let Yı and Yə be independent, having the chi-square distribution 
with vı and v2 degrees of freedom, respectively; set X + Yiv2/Yom. Or set 
X € nY /v (1 — Y), where Y is a beta variate with parameters 1/2 and 12/2. 


F(x) = 


I MP" (va + mt) 22/2 dt, (36) 
0 


5) The t-distribution with v degrees of freedom is defined by 


T((v + 1)/2) E 2/,.\—(v+1)/2 
Let Yı be a normal deviate (mean 0, variance 1) and let Y> be independent 
of Yı, having the chi-square distribution with v degrees of freedom; set X + 
Y1/./Yo/v. Alternatively, when v > 2, let Yı be a normal deviate and let 
Y> independently have the exponential distribution with mean 2/(v — 2); set 
Z + Y?/(v — 2) and reject (Yi, Y2) if e~%2-4 > 1 — Z, otherwise set 


Xe Yi/ vy- 2/v)(1-— Z). 


The latter method is due to George Marsaglia, Math. Comp. 34 (1980), 235-236. 
[See also A. J. Kinderman, J. F. Monahan, and J. G. Ramage, Math. Comp. 31 
(1977), 1009-1018.] 


6) Random point on an n-dimensional sphere with radius one. Let X1, Xo, 
...; Xn be independent normal deviates (mean 0, variance 1); the desired point 
on the unit sphere is 


(Xi/r, Xa/r,...,Xn/r), where r= o/XP + XP ++ X2. (38) 


If the X’s are calculated using the polar method, Algorithm P, we compute two 
independent X’s each time, and we have X? + X2 = —2ln S in the notation 
of that algorithm; this saves a little of the time needed to evaluate r. The 
validity of (38) comes from the fact that the distribution function for the point 
(X1,...,Xn) has a density that depends only on its distance from the origin, so 
when it is projected onto the unit sphere it has the uniform distribution. This 
method was first suggested by G. W. Brown, in Modern Mathematics for the 
Engineer, First series, edited by E. F. Beckenbach (New York: McGraw-Hill, 


136 RANDOM NUMBERS 3.4.1 


1956), 302. To get a random point inside the n-sphere, R. P. Brent suggests 
taking a point on the surface and multiplying it by U!/". 

In three dimensions a significantly simpler method can be used, since each 
individual coordinate is uniformly distributed between —1 and 1: Find Vi, Və, 
and S by steps P1—P3 of Algorithm P; then the desired random point on the 
surface of a globe is (aV,, aV2, 2S — 1), where a = 2y 1 — S. [Robert E. Knop, 
CACM 13 (1970), 326.] 


F. Important integer-valued distributions. A probability distribution that 
is nonzero only at integer values can essentially be handled by the techniques 
described at the beginning of this section; but some of these distributions are so 
important in practice, they deserve special mention here. 


1) The geometric distribution. If some event occurs with probability p, the 
number N of independent trials needed between occurrences of the event (or 
until the event occurs for the first time) has the geometric distribution. We 
have N = 1 with probability p, N = 2 with probability (1 — p)p, ..., N =n 
with probability (1 — p)"~tp. This is essentially the situation we have already 
considered in the gap test of Section 3.3.2; it is also directly related to the number 
of times certain loops in the algorithms of this section are executed, like steps 
P1—P3 of the polar method. 

A convenient way to generate a variable with this distribution is to set 


N + [InU/In(1 —p)]. (39) 


To check this formula, we observe that [mU/In(1 — p)| = n if and only if 
n-1<IMU/In(1—p) <n, that is, (1—p)"~! > U > (1—p)", and this happens 
with probability (1 — p)” tp as required. The quantity lnU can optionally be 
replaced by —Y, where Y has the exponential distribution with mean 1. 

The special case p = 4 is quite simple on a binary computer, since for- 
mula (39) reduces to setting N + [—lgU]; that is, N is one more than the 


number of leading zero bits in the binary representation of U. 


2) The binomial distribution (t, p). If some event occurs with probability p, and 
if we carry out t independent trials, the total number N of occurrences equals n 
with probability (t) p"(1 — p)'~". (See Section 1.2.10.) In other words if we 
generate U1, ..., Us, we want to count how many of these are < p. For small t 
we can obtain N in exactly this way. 

For large t, we can generate a beta variate X with integer parameters a 
and b where a+b—1=t; this effectively gives us the bth largest of t elements, 
without bothering to generate the other elements. Now if X > p, we set N + Nj 
where N; has the binomial distribution (a—1, p/X), since this tells us how many 
of a — 1 random numbers in the range [0..X) are < p; and if X < p, we set 
N «a+ Nj where N; has the binomial distribution (b — 1, (p — X)/(1— X)), 
since Nj tells us how many of b— 1 random numbers in the range |X ..1) are < p. 
By choosing a = 1 + |t/2], the parameter t will be reduced to a reasonable size 
after about lgt reductions of this kind. (This approach is due to J. H. Ahrens, 
who has also suggested an alternative for medium-sized t; see exercise 27.) 
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3) The Poisson distribution with mean u. The Poisson distribution is related 
to the exponential distribution as the binomial distribution is related to the 
geometric: It represents the number of occurrences, per unit time, of an event 
that can occur at any instant of time. For example, the number of alpha particles 
emitted by a radioactive substance in a single second has a Poisson distribution. 

According to this principle, we can produce a Poisson deviate N by gener- 
ating independent exponential deviates X1, Xo, ... with mean 1/, stopping 
as soon as Xı +- + Xm > 1; then N + m-—1. The probability that 
Xı +--+ Xm => 1 is the probability that a gamma deviate of order m is > p, 
and this comes to ie t™—le-t dt/(m —1)!; hence the probability that N = n is 

co 
= t”e™ dt — : 


Th hs) (n—1)! 


Co n 
J le~ dt = gon, n>0. (40) 
y n! 
If we generate exponential deviates by the logarithm method, the recipe above 
tells us to stop when —(ln U1 + -- -+ ln Um)/u > 1. Simplifying this expression, 
we see that the desired Poisson deviate can be obtained by calculating e~”, 
converting it to a fixed point representation, then generating one or more uniform 
deviates U1, U2, ... until the product satisfies U1 ...Um < e~", finally setting 
N «+ m-—1. On the average this requires the generation of 4+1 uniform deviates, 
so it is a very useful approach when is not too large. 

When p is large, we can obtain a method of order log u by using the fact that 
we know how to handle the gamma and binomial distributions for large orders: 
First generate X with the gamma distribution of order m = |ay|, where a is a 
suitable constant. (Since X is equivalent to —In(U,...Um), we are essentially 
bypassing m steps of the previous method.) If X < u, set N 4+ m+ M,, where 
N, is a Poisson deviate with mean u — X; and if X > u, set N «+ Nj, where 
Nı has the binomial distribution (m — 1,u/X). This method is due to J. H. 
Ahrens and U. Dieter, whose experiments suggest that 7 is a good choice for a. 

The validity of the stated reduction when X > p is a consequence of the 


following important principle: “Let X1, ..., Xm be independent exponential 
deviates with the same mean; let Sj = Xı +- + X; and let Vj = $;/Sm 
for 1 < j < m. Then the distribution of Vi, Vo, ..., Vm-1 is the same as 


the distribution of m — 1 independent uniform deviates sorted into increasing 
order.” To establish this principle formally, we compute the probability that 
Vi < vy, .--, Vn-1 < Um—1, given the value of Sm = s, for arbitrary values 
O< vy S++) Lum- <1: Let f(vi,v2,...,Um—1) be the (m — 1)-fold integral 


vs v2s—tı 
| pe!" dti i pe ?/# dts... 
0 0 


Um—-18—t1—-+--tm—2 
: | Be nS dh genie. 
0 


then v g 83 ae 
f(v, va,--- ,Um—1) h U1 ane dug . oo ee Um-—1 


FULT ft du fi duz... fi _, dumi 


u 
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by making the substitution tı = sui, ty + tg = Sug, ..., ty H0 + tm-1 = 
SUm—1- The latter ratio is the corresponding probability that uniform deviates 
Ui, ..., Um-1 satisfy U1 < v1, ..., Um-1 < Um-1, given that they also satisfy 


Uy < + < Um-1.- 
A more efficient but somewhat more complicated technique for binomial and 
Poisson deviates is sketched in exercise 22. 


G. For further reading. A facsimile of a letter from von Neumann dated May 
21, 1947, in which the rejection method first saw the light of day, appears in 
Stanislaw Ulam 1909-1984, a special issue of Los Alamos Science (Los Alamos 
National Lab., 1987), 135-136. The book Non-Uniform Random Variate Gen- 
eration by L. Devroye (Springer, 1986) discusses many more algorithms for the 
generation of random variables with nonuniform distributions, together with a 
careful consideration of the efficiency of each technique on typical computers. 

W. Hörmann and G. Derflinger [ACM Trans. Math. Software 19 (1993), 
489-495] have pointed out that it can be dangerous to use the rejection method 
in connection with linear congruential generators that have small multipliers 
ax ym. 

From a theoretical point of view it is interesting to consider optimal ways to 
generate random variables with a given distribution, in the sense that the method 
produces the desired result from the minimum possible number of random bits. 
For the beginnings of a theory dealing with such questions, see D. E. Knuth 
and A. C. Yao, Algorithms and Complexity, edited by J. F. Traub (New York: 
Academic Press, 1976), 357—428. 

Exercise 16 is recommended as a review of many of the techniques in this 
section. 


EXERCISES 


1. [10] If a and ĝ are real numbers with a < 8, how would you generate a random 
real number uniformly distributed between a and 8? 


2. [M16] Assuming that mU is a random integer between 0 and m — 1, what is 
the exact probability that |kU| = r, if 0 < r < k? Compare this with the desired 
probability 1/k. 

> 3. [14] Discuss treating U as an integer and computing its remainder mod k to get 
a random integer between 0 and k — 1, instead of multiplying as suggested in the text. 
Thus (1) would be changed to 


ENTA 0; LDX U; DIV K, 
with the result appearing in register X. Is this a good method? 
4. [M20] Prove the two relations in (8). 


> 5. [21] Suggest an efficient way to compute a random variable with the distribution 
F(a) = px + qx? + rx®, where p> 0, q>0,r>0,andp+q+r=1. 
6. [HM21] A quantity X is computed by the following method: 
Step 1. Generate two independent uniform deviates U and V. 
Step 2. If U? + V? > 1, return to step 1; otherwise set X + U. 
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What is the distribution function of X? How many times will step 1 be performed? 
(Give the mean and standard deviation.) 

7. [20] (A. J. Walker.) Suppose we have a bunch of cubes of k different colors, say 
nj cubes of color Cj for 1 < j < k, and we also have k boxes {Bi,...,Bx} each of 
which can hold exactly n cubes. Furthermore nı +--+ nk = kn, so the cubes will 
just fit in the boxes. Prove (constructively) that there is always a way to put the cubes 
into the boxes so that each box contains at most two different colors of cubes; in fact, 
there is a way to do it so that, whenever box B; contains two colors, one of those colors 
is Cj. Show how to use this principle to compute the P and Y tables required in (3), 
given a probability distribution (pi,...,px). 


8. [M15] Show that operation (3) could be changed to 
if U< Pg then X + 2x41 otherwise X + Yk 
(thus using the original value of U instead of V) if this were more convenient, by 
suitably modifying Po, Pi, ..., Pr-1. 
9. [HM10] Why is the curve f(x) of Fig. 9 concave for x < 1, convex for x > 1? 
10. [HM24] Explain how to calculate auxiliary constants P;, Q;, Yj, Zj, Sj, Dj, Ej 
so that Algorithm M delivers answers with the correct distribution. 


11. [HM27] Prove that steps M7-M8 of Algorithm M generate a random variable 
with the appropriate tail of the normal distribution; in other words, the probability 
that X < x should be exactly 


f grr ajf ath dt, PAEAS 
3 3 


[Hint: Show that it is a special case of the rejection method, with g(t) = Cte-*/? for 
some C.] 


12. [HM23] (R. P. Brent.) Prove that the numbers a; defined in (23) satisfy the 
relation 
a; —a;_, < 2ln2 for all j > 1. 
[Hint: If f(x) = er /? ‘be et"? dt, show that f(x) > f(y) fr0< z< y.] 
13. [HM25] Given a set of n independent normal deviates, X1, X2, ..., Xn, with 


mean 0 and variance 1, show how to find constants b; and aij, 1 < j < i < n, so that if 


Yı = bı +11 X1, Yo = b2 +a21Xı +a22X2, ..., Yn =bn +0nıXı +: +annXn, 


then Yi, Y2, ..., Yn are dependent normally distributed variables, Y; has mean pj, 
and the Y’s have a given covariance matrix (cij). (The covariance, cij, of Y; and Y; is 
defined to be the average value of (Y; — jui)(Y; — uj). In particular, c;; is the variance 
of Y;, the square of its standard deviation. Not all matrices (cij) can be covariance 
matrices, and your construction is, of course, only supposed to work whenever a solution 
to the given conditions is possible.) 

14. [M21] If X is a random variable with the continuous distribution F(x), and if c 
is a (possibly negative) constant, what is the distribution of cX? 

15. [HM21] If Xı and X2 are independent random variables with the respective 
distributions Fi (x) and F(x), and with densities fı(x) = Fi (x), f(x) = F(x), what 
are the distribution and density functions of the quantity X1 + X2? 
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> 16. [HM22] (J. H. Ahrens.) Develop an algorithm for gamma deviates of order a 
when 0 < a < 1, using the rejection method with cg(t) = t*~'/I'(a) for 0 < t <1, and 
with cg(t) = e™*/T (a) for t > 1. 

> 17. [M24] What is the distribution function F(x) for the geometric distribution with 
probability p? What is the generating function G(z)? What are the mean and standard 
deviation of this distribution? 


18. [M24] Suggest a method to compute a random integer N for which N takes the 
value n with probability np?(1— p)"~', n > 0. (The case of particular interest is when 
p is rather small.) 


19. [22] The negative binomial distribution (t,p) has integer values N = n with 
probability a) p*(1 — p)”. (Unlike the ordinary binomial distribution, t need not 
be an integer, since this quantity is nonnegative for all n whenever t > 0.) Generalizing 
exercise 18, explain how to generate integers N with this distribution when t is a small 


positive integer. What method would you suggest if t = p = 4? 


20. [M20] Let A be the area of the shaded region in Fig. 13, and let R be the area of 
the enclosing rectangle. Let J be the area of the interior region recognized by step R2, 
and let E be the area between the exterior region rejected in step R3 and the outer 
rectangle. Determine the number of times each step of Algorithm R is performed, for 
each of its four variants as in (25), in terms of A, R, I, and E. 


21. [HM29| Derive formulas for the quantities A, R, I, and E defined in exercise 20. 
(For I and especially E you may wish to use an interactive computer algebra system.) 
Show that c = e!/* is the best possible constant in step R2 for tests of the form 
“X? < 4(1+ Inc) — 4cU.” 

22. [HM40] Can the exact Poisson distribution for large u be obtained by generating 
an appropriate normal deviate, converting it to an integer in some convenient way, and 
applying a (possibly complicated) correction a small percent of the time? 


23. [HM23] (J. von Neumann.) Are the following two ways to generate a random 
quantity X equivalent (that is, does the quantity X have the same distribution)? 


Method 1: Set X + sin((7/2)U), where U is uniform. 


Method 2: Generate two uniform deviates, U and V; if U? + V? > 1, repeat 
until U? + V? < 1. Then set X + |U? — V?|/ (U? + V?). 


24. [HM40] (S. Ulam, J. von Neumann.) Let Vo be a randomly selected real number 
between 0 and 1, and define the sequence (Vn) by the rule Vn+1 = 4Vn (1 — Va). If this 
computation is done with perfect accuracy, the result should be a sequence with the 
distribution sin? mU, where U is uniform, that is, with distribution function F(x) = 
i dx/V 27a(1— x). For if we write Va = sin? rUn, we find that Uni = (2U,,) mod 1; 
and by the fact that almost all real numbers have a random binary expansion (see 
Section 3.5), this sequence Un is equidistributed. But if the computation of Vn is done 
with only finite accuracy, the argument breaks down because we soon are dealing with 
noise from the roundoff error. [See von Neumann’s Collected Works 5, 768-770.| 

Analyze the sequence (Vn) defined in the preceding paragraph, when only finite ac- 
curacy is present, both empirically (for various different choices of Vo) and theoretically. 
Does the sequence have a distribution resembling the expected distribution? 


25. [M25] Let Xi, X2, ..., X5 be binary words each of whose bits is independently 
0 or 1 with probability Z. What is the probability that a given bit position of 
Xı | (X2 & (X; | (X4 & X5))) contains a 1? Generalize. 
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26. [M18] Let Nı and Nz be independent Poisson deviates with means py and p2, 
where uı > u2 > 0. Prove or disprove: (a) Ni + N2 has the Poisson distribution with 
mean [41 + u2. (b) Ni — No has the Poisson distribution with mean pi — p2. 


27. [22] (J. H. Ahrens.) On most binary computers there is an efficient way to count 
the number of 1s in a binary word (see Section 7.1.3). Hence there is a nice way to 
obtain the binomial distribution (t, p) when p = 4, simply by generating t random bits 
and counting the number of 1s. 

Design an algorithm that produces the binomial distribution (t, p) for arbitrary p, 
using only a subroutine for the special case p = 4 as a source of random data. [Hint: 
Simulate a process that first looks at the most significant bits of t uniform deviates, 
then at the second bit of those deviates whose leading bit is not sufficient to determine 
whether or not their value is < p, etc.] 


28. [HM35| (R. P. Brent.) Develop a method to generate a random point on the 
surface of the ellipsoid defined by X` akr? = 1, where a > --- > an >Q. 


29. [M20] (J. L. Bentley and J. B. Saxe.) Find a simple way to generate n numbers 
X1,..., Xn that are uniform between 0 and 1 except for the fact that they are sorted: 
Xı <--- < Xn. Your algorithm should take only O(n) steps. 


30. [M30] Explain how to generate a set of random points (X;, Yj) such that, if R is 
any rectangle of area a contained in the unit square, the number of (X;, Y;) lying in R 
has the Poisson distribution with mean aw. 


31. [HM39| (Direct generation of normal deviates.) 

a) Prove that if aj+---+az = 1 and if X4, ..., Xp are independent normal deviates 
with mean 0 and variance 1, then a1Xı +--:-+a,Xx is a normal deviate with 
mean 0 and variance 1. 

b) The result of (a) suggests that we can generate new normal deviates from old ones, 
just as we obtain new uniform deviates from old ones. For example, we might use 
the idea of 3.2.2-(7), but with a recurrence like 


Xn = (Xn—24 + Xn—s5)/V2 or Xn = 3Xn—24 + 2Xn—55; 


after a set of normal deviates Xo, ..., X54 has been computed initially. Explain 
why this is not a good idea. 

c) Show, however, that there is a suitable way to generate normal deviates quickly 
from other normal deviates, by using a refinement of the idea in (a) and (b). [Hint: 
If X and Y are independent normal deviates, so are X’ = X cos 0 + Y sin 0 and 
Y’=—Xsin6+Ycos9, for any angle 6. 


32. [HM30] (C. S. Wallace.) Let X and Y be independent exponential deviates with 
mean 1. Show that X’ and Y’ are, likewise, independent exponential deviates with 
mean 1, if we obtain them from X and Y in any of the following ways: 

a) Given 0< à< 1, 


X =(1-A)X -AY +(X+Y)|0 -AX <AY]), Y'=X+Y-X'. 


bun _ f(2X,¥—X), X<Y; 
b) (X,Y) = aa) EXSY. 


c) If X = G »- 22%1%0.U_-10_2U_-3... )2 and Wa (. - - Y2Y1Y0-Y-1Y-2Y-3.-.-.- )2 in bi- 
nary notation, then X’ and Y” have the “shuffled” values 


XxX’ = ( -- L2Y1%0.Y-1%_-2Y-3... Ja, y’ = (i; -- Y2T1Y0-T—1Y—27T -3 ... Ja. 
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33. [20] Algorithms P, M, F, and R generate normal deviates by consuming an 
unknown number of uniform random variables U1, U2, .... How can they be modified 
so that the output is a function of just one U? 


3.4.2. Random Sampling and Shuffling 


Many data processing applications call for an unbiased choice of n records at 
random from a file containing N records. This problem arises, for example, in 
quality control or other statistical calculations where sampling is needed. Usually 
N is very large, so that it is impossible to contain all the data in memory at once; 
and the individual records themselves are often very large, so that we can’t even 
hold n records in memory. Therefore we seek an efficient procedure for selecting 
n records by deciding either to accept or to reject each record as it comes along, 
writing the accepted records onto an output file. 

Several methods have been devised for this problem. The most obvious 
approach is to select each record with probability n/N; this may sometimes 
be appropriate, but it gives only an average of n records in the sample. The 
standard deviation is ,/n(1 — n/N), and the sample might turn out to be either 
too large for the desired application or too small to give the necessary results. 

Fortunately, a simple modification of the “obvious” procedure gives us what 
we want: The (t+1)st record should be selected with probability (n—m)/(N-—t), 
if m items have already been selected. This is the appropriate probability, since 
of all the possible ways to choose n things from N such that m values occur in 


the first t, exactly 
Gna (VO ene (1) 
n—-m—1 n-m) N-t 


of them select the (t + 1)st element. 
The idea developed in the preceding paragraph leads immediately to the 
following algorithm: 


Algorithm S (Selection sampling technique). To select n records at random 
from a set of N, where 0< n < N. 


S1. [Initialize.] Set t + 0, m + 0. (During this algorithm, m represents the 
number of records selected so far, and t is the total number of input records 
that we have dealt with.) 


S2. [Generate U.] Generate a random number U, uniformly distributed between 
zero and one. 

S3. [Test.] If (N — t)U > n — m, go to step S5. 

S4. [Select.] Select the next record for the sample, and increase m and t by 1. 


If m < n, go to step S2; otherwise the sample is complete and the algorithm 
terminates. 


S5. [Skip.] Skip the next record (do not include it in the sample), increase t 
by 1, and go back to step S2. I 
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This algorithm may appear to be unreliable at first glance and, in fact, to 
be incorrect; but a careful analysis (see the exercises below) shows that it is 
completely trustworthy. It is not difficult to verify that 


a) At most N records are input (we never run off the end of the file before 
choosing n items). 


b) The sample is completely unbiased. In particular, the probability that any 
given element is selected, such as the last element of the file, is n/N. 


Statement (b) is true in spite of the fact that we are not selecting the (t+ 1)st 
item with probability n/N, but rather with the probability in Eq. (1)! This has 
caused some confusion in the published literature. Can the reader explain this 
seeming contradiction? 

(Note: When using Algorithm S, one should be careful to use a different 
source of random numbers U each time the program is run, to avoid connections 
between the samples obtained on different days. This can be done, for example, 
by choosing a different value of Xo for the linear congruential method each time. 
The seed value Xo could be set to the current date, or to the last random 
number X that was generated on the previous run of the program.) 

We will usually not have to pass over all N records. In fact, since (b) above 
says that the last record is selected with probability n/N, we will terminate the 
algorithm before considering the last record exactly (1 — n/N) of the time. The 
average number of records considered when n = 2 is about 2N, and the general 
formulas are given in exercises 5 and 6. 

Algorithm S and a number of other sampling techniques are discussed in a 
paper by C. T. Fan, Mervin E. Muller, and Ivan Rezucha, J. Amer. Stat. Assoc. 
57 (1962), 387-402. The method was independently discovered by T. G. Jones, 
CACM 5 (1962), 343. 


A problem arises if we don’t know the value of N in advance, since the 
precise value of N is crucial in Algorithm S. Suppose we want to select n items 
at random from a file, without knowing exactly how many are present in that 
file. We could first go through and count the records, then take a second pass 
to select them; but it is generally better to sample m > n of the original items 
on the first pass, where m is much less than N, so that only m items must be 
considered on the second pass. The trick, of course, is to do this in such a way 
that the final result is a truly random sample of the original file. 

Since we don’t know when the input is going to end, we must keep track of 
a random sample of the input records seen so far, thus always being prepared for 
the end. As we read the input we will construct a “reservoir” that contains only 
the records that have appeared among the previous samples. The first n records 
always go into the reservoir. When the (t+ 1)st record is being input, for t > n, 
we will have in memory a table of n indices pointing to the records that we have 
chosen from among the first t. The problem is to maintain this situation with 
t increased by one, namely to find a new random sample from among the t + 1 
records now known to be present. It is not hard to see that we should include 
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the new record in the new sample with probability n/(t+ 1), and in such a case 
it should replace a random element of the previous sample. 
Thus, the following procedure does the job: 


Algorithm R (Reservoir sampling). To select n records at random from a file of 
unknown size > n, given n > 0. An auxiliary file called the “reservoir” contains 
all records that are candidates for the final sample. The algorithm uses a table 
of distinct indices [j] for 1 < j < n, each of which points to one of the records 
in the reservoir. 

R1. [Initialize.] Input the first n records and copy them to the reservoir. Set 
Ij) +} j for 1 < j < n, and set t + m + n. (If the file being sampled has 
fewer than n records, it will of course be necessary to abort the algorithm 
and report failure. During this algorithm, indices J[1], ..., I[n] point to the 
records in the current sample; m is the size of the reservoir; and t is the 
number of input records dealt with so far.) 

R2. [End of file?] If there are no more records to be input, go to step R6. 

R3. [Generate and test.] Increase t by 1, then generate a random integer M 

between 1 and ¢ (inclusive). If M > n, go to R5. 

R4. [Add to reservoir.] Copy the next record of the input file to the reservoir, 

increase m by 1, and set I[M] «+ m. (The record previously pointed to by 

I|M] is being replaced in the sample by the new record.) Go back to R2. 

R5. [Skip.] Skip over the next record of the input file (do not include it in the 

reservoir), and return to step R2. 

R6. [Second pass.] Sort the I table entries so that I[1] < +-+- < I[n]; then go 
through the reservoir, copying the records with these indices into the output 
file that is to hold the final sample. J 


Algorithm R is due to Alan G. Waterman. The reader may wish to work 
out the example of its operation that appears in exercise 9. 

If the records are sufficiently short, it is of course unnecessary to have a 
reservoir at all; we can keep the n records of the current sample in memory at 
all times, and the algorithm becomes much simpler (see exercise 10). 

The natural question to ask about Algorithm R is, “What is the expected 
size of the reservoir?” Exercise 11 shows that the average value of m is exactly 
n(1 + Hy — Hn); this is approximately n(1+In(N/n)). So if N/n = 1000, the 
reservoir will contain only about 1/125 as many items as the original file. 

Notice that Algorithms S and R can be used to obtain samples for several 
independent categories simultaneously. For example, if we have a large file of 
names and addresses of U.S. residents, we could pick random samples of exactly 
10 people from each of the 50 states without making 50 passes through the file, 
and without first sorting the file by state. 

Significant improvements to both Algorithms S and R are possible, when 
n/N is small, if we generate a single random variable to tell us how many records 
should be skipped instead of deciding whether or not to skip each record. (See 
exercise 8.) 
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The sampling problem can be regarded as the computation of a random 
combination, according to the conventional definition of combinations of N things 
taken n at a time (see Section 1.2.6). Now let us consider the problem of 
computing a random permutation of t objects; we will call this the shuffling 
problem, since shuffling a deck of cards is nothing more than subjecting the deck 
to a random permutation. 

A moment’s reflection is enough to convince any card player that traditional 
shuffling procedures are miserably inadequate. There is no hope of obtaining each 
of the t! permutations with anywhere near equal probability by such methods. 
Expert bridge players reportedly make use of this fact when deciding whether 
or not to finesse. At least seven “riffle shuffles” of a 52-card deck are needed to 
reach a distribution within 10% of uniform, and 14 random riffles are guaranteed 
to do so [see Aldous and Diaconis, AMM 93 (1986), 333-348]. 

If t is small, we can obtain a random permutation very quickly by generating 
a random integer between 1 and t!. For example, when t = 4, a random number 
between 1 and 24 suffices to select a random permutation from a table of all 
possibilities. But for large t, it is necessary to be more careful if we want to 
claim that each permutation is equally likely, since t! is much larger than the 
accuracy of individual random numbers. 

A suitable shuffling procedure can be obtained by recalling Algorithm 3.3.2P, 
which gives a simple correspondence between each of the t! possible permutations 
and a sequence of numbers (ci,c2,...,c:-1), with 0 < cj < j. It is easy to 
compute such a set of numbers at random, and we can use the correspondence 
to produce a random permutation. 


Algorithm P (Shuffling). Let (X1, Xo,...,X+) be a sequence of t numbers to 
be shuffled. 
P1. [Initialize.] Set j < t. 


P2. [Generate U.] Generate a random number U, uniformly distributed between 
zero and one. 

P3. [Exchange.] Set k + |jU| +1. (Now k is a random integer, between 1 
and j. Exercise 3.4.1-3 explains that k should not be computed by taking 
a remainder modulo j.) Exchange X,  X;. 


P4. [Decrease j.] Decrease j by 1. If j > 1, return to step P2. | 


This algorithm was first published by R. A. Fisher and F. Yates [Statistical 
Tables (London, 1938), Example 12], in ordinary language, and by R. Durstenfeld 
[CACM 7 (1964), 420] in computer language. If we merely wish to generate a ran- 
dom permutation of {1,...,¢} instead of shuffling a given sequence (Xj,...,X+), 
we can avoid the exchange operation X; + X; by letting j increase from 1 to t 
and setting Xj — Xk, Xk & j; see D. E. Knuth, The Stanford GraphBase (New 
York: ACM Press, 1994), 104. 

R. Salfi [COMPSTAT 1974 (Vienna: 1974), 28-35] has pointed out that 
Algorithm P cannot possibly generate more than m distinct permutations when 
we obtain the uniform U’s with a linear congruential sequence of modulus m, 
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or indeed whenever we use a recurrence Uji; = f(U,) for which U, can take 
only m different values, because the final permutation in such cases is entirely 
determined by the value of the first U that is generated. Thus, for example, 
if m = 2°7, certain permutations of 13 elements will never occur, since 13! ~ 
1.45 x 232. In most applications we don’t really want to see all 13! permutations; 
yet it is disconcerting to know that the excluded ones are determined by a fairly 
simple mathematical rule such as a lattice structure (see Section 3.3.4). 

This problem does not arise when we use a lagged Fibonacci generator like 
3.2.2-(7) with a sufficiently long period. But even with such methods we cannot 
get all permutations uniformly unless we are able to specify at least t! different 
seed values to initialize the generator. In other words, we can’t get lgt! truly 
random bits out unless we put lg t! truly random bits in. Section 3.5 shows that 
we need not despair about this. 

Algorithm P can easily be modified to yield a random permutation of a 
random combination (see exercise 15). For a discussion of random combinatorial 
objects of other kinds (e.g., partitions), see Section 7.2 and/or the book Combi- 
natorial Algorithms by Nijenhuis and Wilf (New York: Academic Press, 1975). 


EXERCISES 

1. [M12] Explain Eq. (1). 

2. [20] Prove that Algorithm S never tries to read more than N records of its 
input file. 


3. [22] The (t+1)st item in Algorithm S is selected with probability (n—m)/(N—t), 
not n/N, yet the text claims that the sample is unbiased; thus each item should be 
selected with the same probability. How can both of these statements be true? 


4. [M23] Let p(m,t) be the probability that exactly m items are selected from among 
the first t in the selection sampling technique. Show directly from Algorithm S that 


rin (DEIC) moses 


5. [M24] What is the average value of t when Algorithm S terminates? (In other 
words, how many of the N records have been passed, on the average, before the sample 
is complete?) 

6. [M24] What is the standard deviation of the value computed in exercise 5? 


7. [M25] Prove that any given choice of n records from the set of N is obtained by 
Algorithm S with probability 1/ (5 ). Therefore the sample is completely unbiased. 


8. [M39] (J. S. Vitter.) Algorithm S computes one uniform deviate for each input 
record it handles. The purpose of this exercise is to consider a more efficient approach 
in which we calculate more quickly the proper number X of input records to skip before 
the first selection is made. 


a) What is the probability that X > k, given k? 

b) Show that the result of (a) allows us to calculate X by generating only one 
uniform U and then doing O(X) other calculations. 

c) Show that we may also set X + min(Yw, Yn-1,.-., Yw—n+1), where the Y’s are 
independent and each Y; is a random integer in the range 0 < Y; < t. 
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d) For maximum speed, show that X can also be calculated in O(1) steps, on the 
average, using a “squeeze method” like Eq. 3.4.1—(18). 


9. [12] Let n = 3. If Algorithm R is applied to a file containing 20 records numbered 
1 thru 20, and if the random numbers generated in step R3 are respectively 


4,1,6,7,5,3, 5,11, 11,3, 7,9, 3, 11,4,5,4, 


which records go into the reservoir? Which are in the final sample? 


10. [15] Modify Algorithm R so that the reservoir is eliminated, assuming that the n 
records of the current sample can be held in memory. 


11. [M25] Let pm be the probability that exactly m elements are put into the reservoir 
during the first pass of Algorithm R. Determine the generating function G(z) = 
Xm Pmz™, and find the mean and standard deviation. (Use the ideas of Section 1.2.10.) 


12. [M26] The gist of Algorithm P is that any permutation 7 can be uniquely written 
as a product of transpositions in the form m = (a;t) ...(a@33)(a22), where 1 < aj < j 
for t > j > 1. Prove that there is also a unique representation of the form 7 = 
(b22)(b33) ... (bet), where 1 < b; < j for 1 < j < t, and design an algorithm that 
computes the b’s from the a’s in O(t) steps. 

13. [M23] (S. W. Golomb.) One of the most common ways to shuffle cards is to divide 
the deck into two parts as equal as possible, and to “riffle” them together. (According 
to the discussion of card-playing etiquette in Hoyle’s rules of card games, “A shuffle of 
this sort should be made about three times to mix the cards thoroughly.”) Consider 


a deck of 2n — 1 cards X1, X2, ..., Xen-1; a “perfect shuffle” s divides this deck into 
X1, X2, ..., Xn and Xn41, ..., X2n-1, then perfectly interleaves them to obtain Xi, 
Xn4+1, X2, Xn+2,---, Xan—1, Xn. The “cut” operation cÍ changes X1, X2, ..., Xan-1 
into Xj+1, ..., X2n-1, X1, ..., Xj. Show that by combining perfect shuffles and cuts, 


at most (2n — 1)(2n — 2) different arrangements of the deck are possible, if n > 1. 
14. [22] A cut-and-riffle permutation of apai...an—1 changes it to a sequence that 
contains the subsequences 

Qx Q(x+1) modn +++ @(y—1) mod n and Qy Q(y4+1) mod n --+4(2—1) mod n 


intermixed in some way, for some x and y. Thus, 3890145267 is a cut-and-riffle of 
0123456789, with xz = 3 and y= 8. 

a) Beginning with 52 playing cards arranged in the standard order 
2345678910JQKA23456789 
ERERREREREEERESE SOOO OOOO 

Mr. J. H. Quick (a student) did a random cut-and-riffle; then he removed the 
leftmost card and inserted it in a random place, obtaining the sequence 


10J QKA234 


J 5678910JQKA2345 
OOOO 90999992009 


678910JQKA 
VVOVVOMAAAAAAHAAAAAA 


a> 


910K JQAKA2Q32345674895106J7 J 
ROBBERS OO MMMAMAMMAS MASAO HOBO HO OO 
Which card did he move from the leftmost position? 


b) Starting again with the deck in its original order, Quick now did three cut-and- 
rifles before moving the leftmost card to a new place: 
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> 15. [30] (Ole-Johan Dahl.) If X, = k for 1 < k < t at the start of Algorithm P, and 
if we terminate the algorithm when j reaches the value t — n, the sequence X¢—n+1, 
..., Xt is a random permutation of a random combination of n elements. Show how 
to simulate the effect of this procedure using only O(n) cells of memory. 


> 16. [M25] Devise a way to compute a random sample of n records from N, given N 
and n, based on the idea of hashing (Section 6.4). Your method should use O(n) storage 
locations and an average of O(n) units of time, and it should present the sample as a 
sorted set of integers 1 < Xı < X2 < -< Xn < N. 


17. [M22] (R. W. Floyd.) Prove that the following algorithm generates a random 
sample S of n integers from {1,..., N}: Set S 4+ @; then for j + N-—n+1, N—n+2, 
.., N (in this order), set k + |jU| +1 and 


SU{k}, ifk¢ s; 
oe eae ifke S. 


> 18. [M32] People sometimes try to shuffle n items (X1, X2,..., Xn) by successively 
interchanging 
Xi o> Xk; Xo > D.C mn TET Xn > Xkn, 


where the indices kj are independent and uniformly random between 1 and n. 
Consider the directed graph with vertices {1,2,...,n} and with arcs from j to 
kj for 1 < j < n. Describe the digraphs of this type for which, if we start with 
the elements (X1, X2,..., Xn) = (1,2,...,n), the stated interchanges produce the 
respective permutations (a) (n,1,2,...); (b) (1,2,..., n); (c) (2,...,n,1). Conclude 
that these three permutations are obtained with wildly different probabilities. 
> 19. [M28] (Priority sampling.) Consider a file of N items in which the kth item 
has a positive weight wg. Let qk = Uk/wp for 1 < k < N, where {Ui,...,Un} are 
independent uniform deviates in [0..1). If r is any real number, define 


a” = max(w,,1/r), if qe <r; orb) = max(w,z,1/r), if qe <7; 
B10; if qk > r; k 1.0; if qk >r. 
a) If r is the nth smallest element of {q1,..., qn}, prove that the expected value 


Eo as” Bae w is w1w2... Wk, for 1 < k < n < N. Hint: Show that, if s is the 


(n—k)th smallest element of {qk+1, . - - , qN }, we have a”. . ao = ae, 7 as. 
(Notice that the quantity s is independent of {Ui,...,Ux}.) 
b) Consequently EOM one a” = W;,.-.W;, when ji <+- < jr. 


c) Show that, ifn > 2, the variance Var(@ +. . +00) is Var(@)+ . -+Var(@”). 

d) Given n, explain how to modify the reservoir sampling method so that the value 
of r and the n — 1 items with subscripts {j | qj < r} can be obtained with one 
pass through a file of unknown size N. Hint: Use a priority queue of size n. 


By means of the thread one understands the ball of yarn, 
so we'll be satisfied and assured by having this sample. 
— MIGUEL DE CERVANTES, El! Ingenioso Hidalgo 
Don Quixote de la Mancha (1605) 
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*3.5. WHAT IS A RANDOM SEQUENCE? 


A. Introductory remarks. We have seen in this chapter how to generate 
sequences 
(Un) = Uo, U1, Ud, T (1) 


of real numbers in the range 0 < U, < 1, and we have called them “random” 
sequences even though they are completely deterministic in character. To justify 
this terminology, we claimed that the numbers “behave as if they are truly 
random.” Such a statement may be satisfactory for practical purposes (at the 
present time), but it sidesteps a very important philosophical and theoretical 
question: Precisely what do we mean by “random behavior”? A quantitative 
definition is needed. It is undesirable to talk about concepts that we do not 
really understand, especially since many apparently paradoxical statements can 
be made about random numbers. 

The mathematical theory of probability and statistics scrupulously avoids 
the issue. It refrains from making absolute statements, and instead expresses 
everything in terms of how much probability is to be attached to statements 
involving random sequences of events. The axioms of probability theory are 
set up so that abstract probabilities can be computed readily, but nothing is 
said about what probability really signifies, or how this concept can be applied 
meaningfully to the actual world. In the book Probability, Statistics, and Truth 
(New York: Macmillan, 1957), R. von Mises discusses this situation in detail, and 
presents the view that a proper definition of probability depends on obtaining a 
proper definition of a random sequence. 

Let us paraphrase here some statements made by two of the many authors 
who have commented on the subject. 


D. H. Lehmer (1951): “A random sequence is a vague notion embodying 
the idea of a sequence in which each term is unpredictable to the uninitiated 
and whose digits pass a certain number of tests, traditional with statisticians 
and depending somewhat on the uses to which the sequence is to be put.” 


J. N. Franklin (1962): “The sequence (1) is random if it has every property 
that is shared by all infinite sequences of independent samples of random 
variables from the uniform distribution.” 


Franklin’s statement essentially generalizes Lehmer’s to say that the se- 
quence must satisfy all statistical tests. His definition is not completely precise, 
and we will see later that a reasonable interpretation of his statement leads us to 
conclude that there is no such thing as a random sequence! So let us begin with 
Lehmer’s less restrictive statement and attempt to make it precise. What we 
really want is a relatively short list of mathematical properties, each of which is 
satisfied by our intuitive notion of a random sequence; furthermore, the list is to 
be complete enough so that we are willing to agree that any sequence satisfying 
these properties is “random.” In this section, we will develop what seems to be 
an adequate definition of randomness according to these criteria, although many 
interesting questions remain to be answered. 
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Let u and v be real numbers, 0 < u < v < 1. If U is a random variable 
that is uniformly distributed between 0 and 1, the probability that u < U < v 
is equal to v — u. For example, the probability that t <U< 3 is 2, How can 
we translate this property of the single number U into a property of the infinite 
sequence Up, U1, U2, ...? The obvious answer is to count how many times Un 
lies between u and v, and the average number of times should equal v — u. Our 
intuitive idea of probability is based in this way on the frequency of occurrence. 

More precisely, let v(m) be the number of values of j, 0 < j < n, such that 
u < U; < v; we want the ratio v(n)/n to approach the value v—u as n approaches 
infinity: 

y(n) 


Jim, yoo (2) 
If this condition holds for all choices of u and v, the sequence is said to be 
equidistributed. 

Let S(n) be a statement about the integer n and the sequence Uo, U1, ...; 
for example, S(n) might be the statement considered above, “u < Un < v.” We 
can generalize the idea used in the preceding paragraph to define the probability 


that S(n) is true with respect to a particular infinite sequence. 


Definition A. Let v(n) be the number of values of j, 0 < j < n, such that S(j) is 
true. We say that S(n) is true with probability A if the limit as n tends to infinity 
of v(n)/n equals A. Symbolically: Pr(S(n)) = if limy..u(n)/n=. 1 


In terms of this notation, the sequence Up, Uj, ... is equidistributed if and 
only if Pr(u < Un < v) = v — u, for all real numbers u, v withO<u<vu< 1. 

A sequence might be equidistributed without being random. For example, 
if Uo, U1, ... and Vo, Vi, ... are equidistributed sequences, it is not hard to show 
that the sequence 


Wo, W1, W2, W3,... = Uo, $+4V0, 4U, +i, bey (3) 


is also equidistributed, since the subsequence $U, $U, ... is equidistributed 
between 0 and 5, while the alternate terms $ + $Vo, 4 + iVi, ..., are equi- 
distributed between 4 and 1. But in the sequence of W’s, a value less than i is 
always followed by a value greater than or equal to z, and conversely; hence the 
sequence is not random by any reasonable definition. A stronger property than 
equidistribution is needed. 

A natural generalization of the equidistribution property, which removes 
the objection stated in the preceding paragraph, is to consider adjacent pairs of 


numbers of our sequence. We can require the sequence to satisfy the condition 
Pr(uy < Un < v1 and ug < Un41 K v2) = (vı = ur) (v2 = u2) (4) 


for any four numbers u1, v1, U2, vg with O < uy < vı < 1, 0 < u2 < v2 < 1. 
And in general, for any positive integer k we can require our sequence to be 
k-distributed in the following sense: 
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Definition B. The sequence (1) is said to be k-distributed if 
Pr(uy < Un < v1, ..-, Uk < Unsr—-1 < vk) = (vı — u1)... (UR — uk) (5) 
for all choices of real numbers uj, vj, with O < uj < vj <1 fori<j<k. I 


An equidistributed sequence is a 1-distributed sequence. Notice that if k > 1, 
a k-distributed sequence is always (k — 1)-distributed, since we may set uz = 0 
and vz, = 1 in Eq. (5). Thus, in particular, any sequence that is known to be 
4-distributed must also be 3-distributed, 2-distributed, and equidistributed. We 
can investigate the largest k for which a given sequence is k-distributed; and this 
leads us to formulate a stronger property: 


Definition C. A sequence is said to be oo-distributed if it is k-distributed for 
all positive integers k. Į 


So far we have considered “[0..1) sequences,” that is, sequences of real 
numbers lying between zero and one. The same ideas apply to integer-valued 
sequences; let us say that the sequence (Xn) = Xo, X1, X2, ...is a b-ary sequence 
if each Xn is one of the integers 0, 1,..., b—1. Thus, a 2-ary (binary) sequence 
is a sequence of zeros and ones. 

We also define a k-digit b-ary number as a string of k integers 7122... Xx, 
where 0 < zj <bforl <j <k. 


Definition D. A b-ary sequence is said to be k-distributed if 
Pr(XyXn41- oe Xptk-1 = T1T2... Tk) = 1/0" (6) 
for all b-ary numbers zız2... £k. | 


It is clear from this definition that if Uo, U1, ... is a k-distributed [0..1) 
sequence, then the sequence |bUo], |[bU1], ... is a k-distributed b-ary sequence. 
(If we set uj = xj/b, vj = (xj + 1)/b, Xn = [bUn], Eq. (5) becomes Eq. (6).) 
Furthermore, every k-distributed b-ary sequence is also (k — 1)-distributed, if 
k > 1: We add together the probabilities for the b-ary numbers 21... x,_ 10, 
U1... Up-11, ..., Z1... Zk—1 (b — 1) to obtain 


Pr(Xn... Xn4k-2 = 21... 04-1) = 1/0". 


(Probabilities for disjoint events are additive; see exercise 5.) It therefore is 
natural to speak of an oo-distributed b-ary sequence, as in Definition C above. 

The representation of a positive real number in the radix-b number system 
may be regarded as a b-ary sequence; for example, 7 corresponds to the 10-ary 
sequence 3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, .... People have conjectured that 
this sequence is oo-distributed, but nobody has yet been able to prove that it is 
even 1-distributed. 

Let us analyze these concepts a little more closely in the case when k equals 
a million. A binary sequence that is 1000000-distributed is going to have runs of 
a million zeros in a row! Similarly, a [0..1) sequence that is 1000000-distributed 
is going to have runs of a million consecutive values each of which is less than Z. 
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It is true that this will happen only (4)*000000 of the time, on the average, but 
the fact is that it does happen. Indeed, this phenomenon will occur in any truly 
random sequence, using our intuitive notion of “truly random.” One can easily 
imagine that such a situation will have a drastic effect if this set of a million 
“truly random” numbers is being used in a computer-simulation experiment; 
there would be good reason to complain about the random number generator. 
However, if we have a sequence of numbers that never has runs of a million 
consecutive U’s less than 5, the sequence is not random, and it will not be a 
suitable source of numbers for other conceivable applications that use extremely 
long blocks of U’s as input. In summary, a truly random sequence will exhibit 
local nonrandomness. Local nonrandomness is necessary in some applications, 
but it is disastrous in others. We are forced to conclude that no sequence of 
“random” numbers can be adequate for every application. 

In a similar vein, one may argue that it is impossible to judge whether a 
finite sequence is random or not; any particular sequence is just as likely as any 
other one. These facts are definitely stumbling blocks if we are ever to have a 
useful definition of randomness, but they are not really cause for alarm. It is 
still possible to give a definition for the randomness of infinite sequences of real 
numbers in such a way that the corresponding theory (viewed properly) will give 
us a great deal of insight concerning the ordinary finite sequences of rational 
numbers that are actually generated on a computer. Furthermore, we shall see 
later in this section that there are several plausible definitions of randomness for 
finite sequences. 


B. co-distributed sequences. Let us now make a brief study of the theory 
of sequences that are oo-distributed. To describe the theory adequately, we will 
need to use a bit of higher mathematics, so we assume in the remainder of this 
subsection that the reader knows the material ordinarily taught in an “advanced 
calculus” course. 
First it is convenient to generalize Definition A, since the limit appearing 
there does not exist for all sequences. We define 
= .  U(n) te Y) 
Pr(S(n)) =limsup ===; — Pr($(n)) = liminf <. (7) 
Then Pr(S(n)), if it exists, is the common value of Pr($(n)) and Pr($(n)). 
We have seen that a k-distributed [0..1) sequence leads to a k-distributed 
b-ary sequence, if U is replaced by |bU|. Our first theorem shows that a converse 
result is also true. 


Theorem A. Let (Un) = Uo, U1, U2, ... be a [0..1) sequence. If the sequence 
((bjUn]) = [bj Uo], [b;U1], |b; U2], tee 


is a k-distributed b,;-ary sequence for all b; in an infinite sequence of integers 
1 < by < b2 < ba < ---, then the original sequence (Un) is k-distributed. 


As an example of this theorem, suppose that b; = 23, The sequence 
|2/Up|, |2U,], ... is essentially the sequence of the first j bits of the binary 
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representations of Up, U1, .... If all these integer sequences are k-distributed, 
in the sense of Definition D, then the real-valued sequence Uo, U1, ... must also 
be k-distributed in the sense of Definition B. 


Proof of Theorem A. If the sequence |bUo|, |bU1], ... is k-distributed, it follows 
by the addition of probabilities that Eq. (5) holds whenever each uj and vy is a 
rational number with denominator b. Now let uj, vj be any real numbers, and 


let u4, v} be rational numbers with denominator b such that 
/ 1 / f 
uj Su; < uj + 1/b, vj S vj < vj + 1/b. 
Let S(n) be the statement that ui < Un < v1, -.-, Uk < Un+k-1 < Up. We have 
= 1 1 
Pr(S(n)) < Pr(u, <U„ <v + po Up < Unsp_1 < Vk + z) 
1 1 
= (i - uh, +5) (m-t) 
/ 1 / / 1 / 
Pr(S(n)) > Pr| uty <Un < vj, ..., uty < Un+k—-1 < Vk 


1 1 
= (uu, - 5)... (uk - 5). 


Now |(v; — u; +1/b) — (vj — uj)| < 2/0. Since our inequalities hold for all b = bj, 
and since bj —> œ as j — oo, we have 


(vy — uy)... (Vk — uk) < Pr(S(n)) < Pr(S(n)) < (vı — u1)... (vk — uk). I 


The next theorem is our main tool for proving things about k-distributed 
sequences. 


Theorem B. Suppose that (U,,) is a k-distributed [0..1) sequence, and let 
f(£1,£2,..., £k) be a Riemann-integrable function of k variables; then 


1 1 1 
lim — y PnU] f f (£1, £2,..., £k) dx... dEk. 
n> n A 0 0 
0<j<n (8) 


Proof. The definition of a k-distributed sequence states that this result is true 
in the special case that 


f(£1,..-, £k) = [u1 < £1 <01, ..., URS Ee < UE | (9) 


for some constants u1, V1, ..., Uk, Vk- Therefore Eq. (8) is true whenever f = 
ai fi +a2f2 +- + amfm and when each f; is a function of type (9); in other 
words, Eq. (8) holds whenever f is a “step-function” obtained by partitioning the 
unit k-dimensional cube into subcells whose faces are parallel to the coordinate 
axes, and assigning a constant value to f on each subcell. 

Now let f be any Riemann-integrable function. If € is any positive number, 
we know (by the definition of Riemann-integrability) that there exist step func- 
tions f and f such that f(21,...,2%) < f(a1,...,@%) < f(a1,..., £k), and such 
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that the difference of the integrals of f and f is less than e. Since Eq. (8) holds 
for f and f, and since 


P2 (U. pieces Uj+4e— ja 5 fU. jste’ Utk- 1) 
N o<jen ee 
<i XO FU; Us), 
N a<j<n 


we conclude that Eq. (8) is true also for f. | 


Theorem B can be applied, for example, to the permutation test of Sec- 
tion 3.3.2. Let (p1, p2,..., pp) be any permutation of the numbers {1,2,..., k}; 
we want to show that 


Pr(Un+p,-1 < Un+tp2-1 <— ey Un+p,—1) = 1/k!. (10) 
To prove this, assume that the sequence (Un) is k-distributed, and let 
Fl tiat) = [Wp <p Re <Q; |. 
We have 


Pr(Un+pi-1 < Un+p2-1 < < Un+pr-1) 


1 iL 
=) f f(a1,.--,@~) da... dEk 
0 0 
1 Lp, Lp Tp 1 
= | ap, f = toy, | Oey = g 
0 0 0 0 


Corollary P. Ifa [0..1) sequence is k-distributed, it satisfies the permutation 
test of order k, in the sense of Eq. (10). | 


We can also show that the serial correlation test is satisfied: 


Corollary S. Ifa [0..1) sequence is (k + 1)-distributed, the serial correlation 
coefficient between U, and Un+ķ tends to zero: 


10 U;U; 4% — (4 Uj) a 


lim =0. 
EDU - (FLU) EDGR (FL Uie4)) 
(All summations here are for 0 < j < n.) 
Proof. By Theorem B, the quantities 
ae Se ee  abUp Ujk 
tend to the respective limits + Zp T T L, l as n> oœ. I 
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Let us now consider some slightly more general distribution properties of 
sequences. We have defined the notion of k-distribution by considering all of 
the adjacent k-tuples; for example, a sequence is 2-distributed if and only if the 
points 

(Uo, U1), (U1, U2), (U2, Us), (U3, Us), (U4, Us), ais 


are equidistributed in the unit square. It is quite possible, however, that this can 
happen while alternate pairs of points (U1, U2), (Us,U4), (Us,U6), ... are not 
equidistributed; if the density of points (U2,—1, U2,,) is deficient in some area, the 
other points (U2n, U2n+1) might compensate. For example, the periodic binary 
sequence 


(Xn) =0,0,0,1, 0,0,0,1, 1,1,0,1, 1,1,0,1, 0,0,0,1, ..., (11) 


with a period of length 16, is seen to be 3-distributed; yet the sequence of even- 
numbered elements (Xən)} = 0, 0, 0, 0, 1, 0, 1, 0, ... has three times as many 
zeros as ones, while the subsequence of odd-numbered elements (Xən+1) = 0, 1, 


0, 1, 1, 1, 1, 1, ... has three times as many ones as zeros. 
Suppose the sequence (Un) is co-distributed. Example (11) shows that the 
subsequence of alternate terms (U2n) = Uo, U2, U4, Us, ... is not obviously 


guaranteed to be oo-distributed or even 1-distributed. But we shall see that 
(U2) is, in fact, co-distributed, and much more is true. 


Definition E. A [0..1) sequence (U,,) is said to be (m,k)-distributed if 


Pr(uı < Umn+j < v1, u2 < Umn+j+1 < V2, ..-, Uk < Umn+j+k-1 < Uk) 
= (v1 — u1)... (Uk — uk) 


for all choices of real numbers urp, vp with O < up < vp < 1 for1 < r < k, and 
for all integers j withO<j<m. I 


Thus a k-distributed sequence is the special case m = 1 in Definition E; the case 
m = 2 means that the k-tuples starting in even positions must have the same 
density as the k-tuples starting in odd positions, etc. 

The following properties of Definition E are obvious: 


An (m, k)-distributed sequence is (m,«)-distributed for 1 < k < k. (12) 
An (m, k)-distributed sequence is (d, k)-distributed for all divisors d ofm. (13) 


(See exercise 8.) We can also define the concept of an (m, k)-distributed b-ary 
sequence, as in Definition D; and the proof of Theorem A remains valid for 
(m, k)-distributed sequences. 

The next theorem, which is in many ways rather surprising, shows that the 
property of being oo-distributed is very strong indeed, much stronger than we 
imagined it to be when we first considered the definition of the concept. 


Theorem C (Ivan Niven and H. S. Zuckerman). An oo-distributed sequence is 
(m, k)-distributed for all positive integers m and k. 
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Proof. It suffices to prove the theorem for b-ary sequences, by using the general- 
ization of Theorem A just mentioned. Furthermore, we may assume that m = k, 
because (12) and (13) tell us that the sequence will be (m, k)-distributed if it is 
(mk, mk)-distributed. 

So we will prove that any co-distributed b-ary sequence Xo, X1, ... is (m,m)- 
distributed for all positive integers m. Our proof is a simplified version of the 
original one given by Niven and Zuckerman in Pacific J. Math. 1 (1951), 103-109. 

The key idea we shall use is an important technique that applies to many 
situations in mathematics: “If the sum of m quantities and the sum of their 
squares are both consistent with the hypothesis that the m quantities are equal, 
then that hypothesis is true.” In a strong form, this principle may be stated as 
follows: 


Lemma E. Given m sequences of numbers (yjn) = yjo, yj1, --- for 1 <j <m, 
suppose that 
lim (Yin T Yn Fee Umin) = MQ, 
lim sup (Yin eis Yon Sle qo Ynn) < mo. 
noo 


Then for each j, limn+oo Yjn exists and equals a. 
An incredibly simple proof of this lemma is given in exercise 9. J 

Resuming our proof of Theorem C, let £x = £1£2... £m be a b-ary number, 
and say that x occurs at position p if Xp-m+1Xp-m+2. .. Xp = z. Let v;(n) be 
the number of occurrences of x at position p when p < n and p mod m = j. Let 
Yjn = V;(n)/n; we wish to prove that 


li in = —. 
dita, Ym = aa oe 
First we know that 
f 1 
Jim (Yon +Yin treet Ue) = bm? (16) 


since the sequence is m-distributed. By Lemma E and Eq. (16), the theorem 
will be proved if we can show that 


(17) 


This inequality is not obvious yet; some rather delicate maneuvering is 
necessary before we can prove it. Let q be a multiple of m, and consider 


c- > ae), 


0<j<m 


: 1 
lim sup (Yon + Vin aa aie ase) < 


noo T mb2m" 


This is the number of pairs of occurrences of x in positions pı and ps for which 
n— q < pı < p2 < n and p2 — pı is a multiple of m. Consider now the sum 


N+q 


Sv = X C(n). (19) 
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Each pair of occurrences of x in positions pı and pg with pı < po < pı +q, where 
P2 — pı is a multiple of m and pı < N, is counted exactly pı + q — p2 times in 
the total Sy (namely, when po < n < pı + q); and the pairs of such occurrences 
with N < pı < p2 < N+ q are counted exactly N + q — p2 times. 

Let di(n) be the number of pairs of occurrences of x in positions pı and po 
with pı +t = po < n. The analysis above shows that 


XO (q=mt)dm(N+q)>Snv > X (q—mt)dmi(N). (20) 
0<t<q/m 0<t<q/m 
Since the original sequence is qg-distributed, 
1 1 
Nim, ay tml) = gam ey 
for all t, 0 < t < q/m, and therefore by (20) we have 
— mt = 
lim oN = 5 q ule = ala m) 7 (22) 


b2m Imb2™ 
0<t<q/m 


This fact will prove the theorem, after some manipulation. 
By definition, 


N+q 


28v => YO (n) -rln - a)? - (vln) -vilin - 9), 


n=1 0<j<m 


and we can remove the unsquared terms by applying (16) to get 


jim sy = geet pm (23) 
where 
N+q 5 
Tu=), >) (ln) —j(n—- 9) 
n=1 0<j<m 


(see exercise 1.2.3-30), we find that 


a *_ aq—m) 
limsup > ware ( > (lr) vn) sn a t a 


N- oo 0<j mN 


We also have 
N+q 


gu(N)< So y(n) = Y (u(n) - rln- a)) < av(N +a), 
N<n<N+q n=1 
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and putting this into (24) gives 


lim sup 5 E = : ; (25) 


This formula has been established whenever q is a multiple of m; and if we let 
q — co we obtain (17), completing the proof. 

For a possibly simpler proof, see J. W. S. Cassels, Pacific J. Math. 2 (1952), 
555-557. I 


Exercises 29 and 30 illustrate the nontriviality of this theorem, and they 
also demonstrate the fact that a q-distributed sequence will have probabilities 
deviating from the true (m, m)-distribution probabilities by essentially 1/\/q at 
most. (See (25).) The full hypothesis of oo-distribution is necessary for the 
proof of the theorem. 

As a result of Theorem C, we can prove that an oo-distributed sequence 
passes the serial test, the maximum-of-t test, the collision test, the birthday 
spacings test, and the tests on subsequences mentioned in Section 3.3.2. It is not 
hard to show that the gap test, the poker test, and the run test are also satisfied 
(see exercises 12 through 14). The coupon collector’s test is considerably more 
difficult to deal with, but it too is passed (see exercises 15 and 16). 


The existence of co-distributed sequences of a rather simple type is guaran- 
teed by the next theorem. 


Theorem F (J. N. Franklin). The [0..1) sequence Uo, Ui, U2, ... with 
U, = 0” mod 1 (26) 
is coo-distributed for almost all real numbers 0 > 1. That is, the set 
{0 | 0 > 1 and (26) is not oo-distributed} 
is of measure zero. 


The proofs of this theorem and some generalizations are given in Math. Comp. 
17 (1963), 28-59. J 


Franklin has shown that 0 must be a transcendental number for (26) to 
be co-distributed. Early in the 1960s, the powers (z” mod 1) were laboriously 
computed for n < 10000 using multiple-precision arithmetic; and the most 
significant 35 bits of each of these numbers, stored on a disk file, were used 
successfully as a source of uniform deviates. According to Theorem F, the 
probability that the powers (z” mod 1) are oo-distributed is equal to 1; yet there 
are uncountably many real numbers, so the theorem gives us no information 
about whether the sequence for m is really oo-distributed or not. It is a fairly 
safe bet that nobody in our lifetimes will ever prove that this particular sequence 
is not co-distributed; but it might not be. Because of these considerations, one 
may legitimately wonder if there is any explicit sequence that is oo-distributed: 
Is there an algorithm to compute real numbers U,, for all n > 0, such that 
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the sequence (U;,,) is co-distributed? The answer is yes, as shown for example 
by D. E. Knuth in BIT 5 (1965), 246-250. The sequence constructed there 
consists entirely of rational numbers; in fact, each number U,, has a terminating 
representation in the binary number system. Another construction of an explicit 
co-distributed sequence, somewhat more complicated than the sequence just 
cited, follows from Theorem W below. See also N. M. Korobov, Izv. Akad. Nauk 
SSSR 20 (1956), 649-660. 


C. Does o-distributed = random? In view of all the theoretical results 
about oo-distributed sequences, we can be sure of one thing: The concept of 
an co-distributed sequence is an important one in mathematics. There is also a 
good deal of evidence that the following statement might be a valid formulation 
of the intuitive idea of randomness: 


Definition R1. A [0..1) sequence is defined to be “random” if it is an oo- 
distributed sequence. | 


We have seen that sequences meeting this definition will satisfy all the statistical 
tests of Section 3.3.2 and many more. 

Let us attempt to criticize this definition objectively. First of all, is every 
“truly random” sequence oo-distributed? There are uncountably many sequences 
Uo, U1, ... of real numbers between zero and one. If a truly random number 
generator is sampled to give values Up, U1, ..., any of the possible sequences may 
be considered equally likely, and some of the sequences (indeed, uncountably 
many of them) are not even equidistributed. On the other hand, using any 
reasonable definition of probability on this space of all possible sequences leads 
us to conclude that a random sequence is oo-distributed with probability one. 
We are therefore led to formalize Franklin’s definition of randomness (as given 
at the beginning of this section) in the following way: 


Definition R2. A [0..1) sequence (U,,) is defined to be “random” if, whenever 
P is a property such that P((V,)) holds with probability one for a sequence (Vn) 
of independent samples of random variables from the uniform distribution, then 
P((U;)) is true. | 


Is it perhaps possible that Definition R1 is equivalent to Definition R2? 
Let us try out some possible objections to Definition R1, and see whether these 
criticisms are valid. 

In the first place, Definition R1 deals only with limiting properties of the 
sequence as n — co. There are co-distributed sequences in which the first million 
elements are all zero; should such a sequence be considered random? 

This objection is not very substantial. If e is any positive number, there 
is no reason why the first million elements of a sequence should not all be less 
than e. With probability one, a truly random sequence contains infinitely many 
runs of a million consecutive elements less than €, so why can’t this happen at 
the beginning of the sequence? 
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On the other hand, consider Definition R2 and let P be the property that 
all elements of the sequence are distinct; P is true with probability one, so any 
sequence with a million zeros is not random by this criterion. 

Now let P be the property that no element of the sequence is equal to 
zero; again, P is true with probability one, so by Definition R2 any sequence 
with a zero element is nonrandom. More generally, however, let xo be any fixed 
number between zero and one, and let P be the property that no element of 
the sequence is equal to xo; Definition R2 now says that no random sequence 
may contain the element x9! We can now prove that no sequence satisfies the 
condition of Definition R2. (For if Up, U1, ... is such a sequence, take zo = Up.) 

Therefore if R1 is too weak a definition, R2 is certainly too strong. The 
“right” definition must be less strict than R2. We have not really shown that R1 
is too weak, however, so let us continue to attack it some more. As mentioned 
above, an oo-distributed sequence of rational numbers has been constructed. 
(Indeed, this is not so surprising; see exercise 18.) Almost all real numbers are 
irrational; perhaps we should insist that 


Pr(U;, is rational) = 0 


for a random sequence. 

The definition of equidistribution, Eq. (2), says that Pr(u < Un < v) = v—u. 
There is an obvious way to generalize this definition, using measure theory: “If 
S C [0..1) is a set of measure u, then 


Pr(Un € S) = u, (27) 


” 


for all random sequences (U,,).” In particular, if S is the set of rationals, 
it has measure zero, so no sequence of rational numbers is equidistributed in 
this generalized sense. It is reasonable to expect that Theorem B could be 
extended to Lebesgue integration instead of Riemann integration, if property (27) 
is stipulated. However, once again we find that definition (27) is too strict, 
for no sequence satisfies that property. If Up, Ui, ... is any sequence, the set 
S = {Uo,Ui,...} is of measure zero, yet Pr(U, € S) = 1. Thus, by the force of 
the same argument we used to exclude rationals from random sequences, we can 
exclude all random sequences. 

So far Definition R1 has proved to be defensible. There are, however, some 
quite valid objections to it. For example, if we have a random sequence in the 
intuitive sense, the infinite subsequence 


Uo, Ui, U4, Uo, seey Unz; sae (28) 


should also be a random sequence. This is not always true for an oo-distributed 
sequence. In fact, if we take any oo-distributed sequence and set U,2 < 0 for 
all n, the counts vg(n) that appear in the test of k-distributivity are changed by 
at most y/n, so the limits of the ratios 1,(m)/n remain unchanged. Definition R1 
unfortunately fails to satisfy this randomness criterion. 

Perhaps we should strengthen R1 as follows: 
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Definition R3. A [0..1) sequence is said to be “random” if each of its infinite 
subsequences is co-distributed. | 


Once again, however, the definition turns out to be too strict; any equidistributed 
sequence (U;,) has a monotonic subsequence with Us, < Us, < Us, <->. 

The secret is to restrict the subsequences so that they could be defined by 
a person who does not look at U,, before deciding whether or not it is to be in 
the subsequence. The following definition now suggests itself: 


Definition R4. A [0..1) sequence (U,,) is said to be “random” if, for every 
effective algorithm that specifies an infinite sequence of distinct nonnegative 
integers Sn for n > 0, the subsequence Us, Us,, Us,, ... corresponding to this 
algorithm is co-distributed. Į 


The algorithms referred to in Definition R4 are effective procedures that 
compute sn, given n. (See the discussion in Section 1.1.) Thus, for example, 
the sequence (z” mod 1) will not satisfy R4, since it is either not equidistributed 
or there is an effective algorithm that determines an infinite subsequence sn 
with (7 mod 1) < (7*! mod 1) < (a mod1) < ---. Similarly, no explicitly 
defined sequence can satisfy Definition R4; this is appropriate, if we agree 
that no explicitly defined sequence can really be random. The explicit-looking 
sequence (0” mod 1) actually does, however, satisfy Definition R4, for almost 
all real numbers 0 > 1; this is no contradiction, since almost all 0 are uncom- 
putable by algorithms. J. F. Koksma proved that (0% mod 1) is 1-distributed 
for almost all 0 > 1, if (sn) is any sequence of distinct positive integers [Com- 
positio Math. 2 (1935), 250-258]; H. Niederreiter and R. F. Tichy strengthened 
Koksma’s theorem, replacing “1-distributed” by “oo-distributed” [Mathematika 
32 (1985), 26-32]. Only countably many sequences (s,,) are effectively definable, 
so (0” mod 1) almost always satisfies R4. 

Definition R4 is much stronger than Definition R1; but it is still reasonable 
to claim that Definition R4 is too weak. For example, let (U,,) be a truly random 
sequence, and define the subsequence (U,,,) by the following rules: so = 0; and if 
n > 0, Sn is the smallest integer > n for which Us„—1, Us,,-2, ..., Us,—n are all 
less than L. Thus we are considering the subsequence of values following the first 
consecutive run of n values less than 5. Suppose that “U, < i” corresponds to 
the value “heads” in the flipping of a coin. Gamblers tend to feel that a long run 
of “heads” makes the opposite condition, “tails,” more probable, assuming that 
a true coin is being used; and the subsequence (U,,,) just defined corresponds to a 
gambling system for a man who places his nth bet on the coin toss following the 
first run of n consecutive “heads.” The gambler may think that Pr(U,, > 4) is 
more than z, but of course in a truly random sequence (U,,,) will be completely 
random. No gambling system will ever be able to beat the odds! Definition R4 
says nothing about subsequences formed according to such a gambling system, 
so apparently we need something more. 

Let us define a “subsequence rule” R as an infinite sequence of functions 
(fn(@1,---;%n)) where, for n > 0, fn is a function of n variables, and the 
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value of fa(£1,...,2n) is either 0 or 1. Here 21, ..., £n are elements of some 
set S. (Thus, in particular, fo is a constant function, either 0 or 1.) A sub- 
sequence rule R defines a subsequence of any infinite sequence (Xn) of elements 
of S as follows: The nth term Xn is in the subsequence (X,)R if and only if 
fn(Xo0, X1,---;Xn-1) = 1. Note that the subsequence (X,,)R thus defined is 
not necessarily infinite, and it may in fact contain no elements at all. 

For example, the gambler’s subsequence just described corresponds to the 


following subsequence rule: “fo = 1; and for n > 0, fn(£1,..., 2n) = 1 if and 
only if there is some k in the range 0 < k < n such that the k consecutive 
parameters £m, 2m—1, ---, Vm—k+1 are all < 4 when m = n but not when 
k<m<n.” 

A subsequence rule R is said to be computable if there is an effective 
algorithm that determines the value of fn(£1,..., £n), when n and z1, ..., Zn are 


given as input. We had better restrict ourselves to computable subsequence rules 
when trying to define randomness, lest we obtain an overly restrictive definition 
like R3 above. But effective algorithms cannot deal nicely with arbitrary real 
numbers as inputs; for example, if a real number g is specified by an infinite 
radix-10 expansion, there is no algorithm to determine if x is < 3 or not, since 
all digits of the number 0.333... have to be examined. Therefore computable 
subsequence rules do not apply to all [0..1) sequences, and it is convenient to 
base our next definition on b-ary sequences. 


Definition R5. A b-ary sequence is said to be “random” if every infinite sub- 
sequence defined by a computable subsequence rule is 1-distributed. | 

A [0..1) sequence (Un) is said to be “random” if the b-ary sequence (| bU,,|) 
is “random” for all integers b > 2. 


Note that Definition R5 says only “1-distributed,” not “oo-distributed.” It 
is interesting to verify that this may be done without loss of generality. For we 
may define an obviously computable subsequence rule R(a1... ap) as follows, 
given any b-ary number ay... ax: Let fn(£1,..., £n) = 1 if and only ifn > k—1 
and Zn-k+1 = G1, ---; €n-1 = Ap_1, Zn = ay. Now if (Xn) is a k-distributed 
b-ary sequence, this rule R(a1 .. . ap) — which selects the subsequence consisting 
of those terms just following an occurrence of a1... ag — defines an infinite sub- 
sequence; and if this subsequence is 1-distributed, each of the (k + 1)-tuples 
a1... akak+ı for O < agyi < b occurs with probability 1/b*+! in (Xn). Thus 
we can prove that a sequence satisfying Definition R5 is k-distributed for all k, 
by induction on k. Similarly, by considering the “composition” of subsequence 
rules — if R, defines an infinite subsequence (Xn) R1, then we can define Ri Rə to 
be the subsequence rule for which (X,)RiR2 = ((Xn)R1)R2—we find that all 
subsequences considered in Definition R5 are co-distributed. (See exercise 32.) 

The fact that co-distribution comes out of Definition R5 as a very special 
case is encouraging, and it is a good indication that we may at last have found the 
definition of randomness we have been seeking. But alas, there still is a problem. 
It is not clear that sequences satisfying Definition R5 must satisfy Definition R4. 
The “computable subsequence rules” we have just specified always enumerate 
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subsequences (X,) for which sọ < sı < ---, but (s,) does not have to be 
monotone in R4; it must only satisfy the condition s, Æ Sm for n Æ m. 
To meet this objection, we may combine Definitions R4 and R5 as follows: 


Definition R6. A b-ary sequence (Xn) is said to be “random” if, for every 
effective algorithm that specifies an infinite sequence of distinct nonnegative 
integers (s,,) as a function ofn and the values of X,,,..., Xs,_,, the subsequence 
(X;,,) corresponding to this algorithm is “random” in the sense of Definition R5. 

A [0..1) sequence (Un) is said to be “random” if the b-ary sequence (| bU,,|) 
is “random” for all integers b > 2. J] 


The author contends* that this definition surely meets all reasonable philo- 
sophical requirements for randomness, so it provides an answer to the principal 
question posed in this section. 


D. Existence of random sequences. We have seen that Definition R3 is 
too strong, in the sense that no sequence can satisfy that definition; and the 
formulation of Definitions R4, R5, and R6 above was carried out in an attempt 
to recapture the essential characteristics of Definition R3. In order to show that 
Definition R6 is not overly restrictive, it is still necessary for us to prove that 
sequences satisfying all these conditions exist. Intuitively, we feel quite sure that 
there is no problem, because we believe that a truly random sequence exists 
and satisfies R6; but a proof is really necessary to show that the definition is 
consistent. 

An interesting method for constructing sequences satisfying Definition R5 
has been found by A. Wald, starting with a very simple 1-distributed sequence. 


Lemma T. Let the sequence of real numbers (V,,) be defined in terms of the 
binary system as follows: 


Yai Vet Bet Yad,  Vı=.001, 
Va = spe C1 if n= 2" +a Tl. H Cr. (29) 


Let Ip,...»,. denote the set of all real numbers in [0..1) whose binary representa- 
tion begins with 0.b,...b,; thus 


To,...b,. = [(0.b1. es br)2 ee (0.61. tá br)2 + Za (30) 
Then if v(n) denotes the number of Vp in Ip...» for 0 < k < n, we have 
|u(n)/n — 27] <1/n. (31) 


Proof. Since v(n) is the number of k for which k mod 2" = (bp... b1)2, we have 
y(n) =t or t+ 1 when |n/2"] = t. Hence |v(n) — n/2"| <1. I 


It follows from (31) that the sequence (|2”V,,|) is an equidistributed 2”-ary 
sequence; hence by Theorem A, (V,,) is an equidistributed [0..1) sequence. In- 
deed, it is pretty clear that (Vp) is about as equidistributed as a [0..1) sequence 
can be. (For further discussion of this and related sequences, see J. G. van der 


* At least, he made such a contention when originally preparing this material in 1966. 
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Corput, Proc. Koninklijke Nederl. Akad. Wetenschappen 38 (1935), 813-821, 
1058-1066; J. H. Halton, Numerische Math. 2 (1960), 84-90, 196; S. Haber, J. 
Research National Bur. Standards B70 (1966), 127-136; R. Béjian and H. Faure, 
Comptes Rendus Acad. Sci. A285 (Paris, 1977), 313-316; H. Faure, J. Number 
Theory 22 (1986), 4-20; S. Tezuka, ACM Trans. Modeling and Comp. Simul. 
3 (1993), 99-107. L. H. Ramshaw has shown that the sequence (ọn mod 1) is 
slightly more equally distributed than (V,,); see J. Number Theory 13 (1981), 
138-175.) 

Now let R1, Ra, ... be infinitely many subsequence rules; we seek a sequence 
(Un) for which all the infinite subsequences (U,,)R,; are equidistributed. 


Algorithm W (Wald sequence). Given an infinite sequence of subsequence rules 
Ri, Ro, ... that define subsequences of [0 . . 1) sequences of rational numbers, this 
procedure defines a [0..1) sequence (U;,). The computation involves infinitely 
many auxiliary variables C[a1,...,a,], where r > 1 and where a; = 0 or 1 for 
1<j<r. These variables are initially all zero. 

W1. [Initialize n.] Set n + 0. 

W2. [Initialize r.] Set r + 1. 

W3. [Test R,.] If the element U, is to be in the subsequence defined by Ry, 
based on the values of U; for O < k < n, set ap < 1; otherwise set a, < 0. 
W4. [Is case [a1,...,@,] unfinished?] If Cfai,...,a,] < 3-4", go to W6. 
W5. [Increase r.] Set r + r+ 1 and return to W3. 

W6. [Set U,,.] Increase the value of C[a1,...,a,] by 1 and let k be its new value. 
Set Un + Vk, where V; is defined in Lemma T above. 

W7. [Advance n.] Increase n by 1 and return to W2. J 


Strictly speaking, this is not an algorithm, since it doesn’t terminate; but 
we could of course easily modify the procedure to make it stop when n reaches a 
given value. In order to grasp the idea of the construction, the reader is advised 
to try it out manually, replacing the number 3- 4"~! of step W4 by 2” during 
this exercise. 

Algorithm W is not meant to be a practical source of random numbers. It 
is intended to serve only a theoretical purpose: 


Theorem W. Let (U,,) be the sequence of rational numbers defined by Algo- 
rithm W, and let k be a positive integer. If the subsequence (U,,) Rx is infinite, 
it is 1-distributed. 


Proof. Let Aļaı,...,ar] denote the (possibly empty) subsequence of (U;,) con- 
taining precisely those elements U, that, for all 7 < r, belong to subsequence 
(Un) R; if a; = 1 and do not belong to subsequence (Un) R; if a; = 0. 

It suffices to prove, for all r > 1 and all pairs of binary numbers a1... a; 
and b)...6,, that Pr(U, E Io...) = 27" with respect to the subsequence 
Alai, ..., ar], whenever the latter is infinite. (See Eq. (30).) For if r > k, 
the infinite sequence (U,)R» is the finite union of the disjoint subsequences 
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Ala,,...,@,] for ag = 1 and aj = 0 or 1 for 1 < j < r, j # k; and it follows 
that Pr(U, € Ie...) = 27” with respect to (Un)Rp. (See exercise 33.) This is 
enough to show that the sequence is 1-distributed, by Theorem A. 

Let B|a1,...,ar] denote the subsequence of (Un) that consists of the values 
for those n in which Cla,,...,a@,] is increased by one in step W6 of the algo- 
rithm. By the algorithm, B[a1,...,ap] is a finite sequence with at most 3 -4"~+ 
elements. All but a finite number of the members of A[aj,...,a,] come from the 
subsequences B[a1,...,@r,...,@1], where aj =O or 1 forr < j <t. 

Now assume that Aļaı,...,ar] is infinite, and let Aļaı,..., ar] = (Us,), 
where so < sı < s2 < ---. If N is a large integer, with 4” < 41 < N < 41t1 it 
follows that the number of values of k < N for which U,, is in [,,...5,. is (except 
for finitely many elements at the beginning of the subsequence) 


v( N) =v(N1) +--+ (Nm). 


Here m is the number of subsequences B[a,,...,a¢] listed above in which Us, 
appears for some k < N; Nj is the number of values of k with Us, in the 
corresponding subsequence; and v(NV;) is the number of such values that are also 
in Jp,...b,- Therefore by Lemma T, 


|U(N) = 27 N| = |v(Ny) — 2-7 Ny +--+ + (Nm) — 277 Nm| 
< |v(Ni) — 277M, | +--+ + |v(Nm) — 27° Nn | 
Sm<1t2Q+4te. -Qertt < get, 


The inequality on m follows here from the fact that, by our choice of N, the 
element U,,, is in Bla1,...,az] for some t < q+ 1. 
We have proved that |v(N)/N —27-"| < 2+1/N < 2/ VN. 1 


To show finally that sequences satisfying Definition R5 exist, we note first 
that if (U,) isa [0..1) sequence of rational numbers and if R is a computable sub- 
sequence rule for a b-ary sequence, we can make R into a computable subsequence 
rule R’ for (Un) by letting fi (£1,..., £n) in R’ equal f,(|bri],..., [ban |) in 
R. If the [0..1) sequence (U;,)R’ is equidistributed, so is the b-ary sequence 
([bU;,|)R. Now the set of all computable subsequence rules for b-ary sequences, 
for all values of b, is countable (since only countably many effective algorithms 
are possible), so they may be listed in some sequence Ri, Ra, ...; therefore 
Algorithm W defines a [0..1) sequence that is random in the sense of Defini- 
tion R5. 

This brings us to a somewhat paradoxical situation. As we mentioned earlier, 
no effective algorithm can define a sequence that satisfies Definition R4, and for 
the same reason there is no effective algorithm that defines a sequence satisfying 
Definition R5. A proof of the existence of such random sequences is necessarily 
nonconstructive; how then can Algorithm W construct such a sequence? 

There is no contradiction here; we have merely stumbled on the fact that the 
set of all effective algorithms cannot be enumerated by an effective algorithm. 
In other words, there is no effective algorithm to select the jth computable 
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subsequence rule R4; this happens because there is no effective algorithm to de- 
termine if a computational method ever terminates. But important large classes 
of algorithms can be systematically enumerated; thus, for example, Algorithm W 
shows that it is possible to construct, with an effective algorithm, a sequence that 
satisfies Definition R5 if we restrict consideration to subsequence rules that are 
“primitive recursive.” 

By modifying step W6 of Algorithm W, so that it sets Un <— Vk+t instead 
of Vk, where t is any nonnegative integer depending on a1, ..., ar, we can show 
that there are uncountably many [0..1) sequences satisfying Definition R5. 


The following theorem shows still another way to prove the existence of 
uncountably many random sequences, using a less direct argument based on 
measure theory, even if the strong definition R6 is used: 


Theorem M. Let the real number z, 0 < x < 1, correspond to the binary 
sequence (Xn) if the binary representation of x is (0.X0X1...)2. Under this 
correspondence, almost all x correspond to binary sequences that are random in 
the sense of Definition R6. (In other words, the set of all real x that correspond 
to a binary sequence that is nonrandom by Definition R6 has measure zero.) 


Proof. Let S be an effective algorithm that determines an infinite sequence of 
distinct nonnegative integers (s,,), where the choice of s„ depends only on n and 
Xs, for 0 < k < n; and let R be a computable subsequence rule. Then any 
binary sequence (Xn) leads to a subsequence (X,,)R, and Definition R6 says 
this subsequence must either be finite or 1-distributed. It suffices to prove that 
for fixed R and S the set N(R,S) of all real x corresponding to (Xn), such 
that (X,,,)R is infinite and not 1-distributed, has measure zero. For x has a 
nonrandom binary representation if and only if x is in UJ N(R,S), taken over 
the countably many choices of R and S. 

Therefore let R and S be fixed. Consider the set T(a,a2...a,) defined for 
all binary numbers a,a2...a, as the set of all x corresponding to (Xn), such 
that (Xs„)R has > r elements whose first r elements are respectively equal to 
a1, @2,..., ap. Our first result is that 


T(aia2... ar) has measure < 2~”. (32) 


To prove this, we start by observing that T (aiaz... ar) is a measurable set: Each 
element of T(a,a2...a,) is a real number x = (0.X0X1...)2 for which there 
exists an integer m such that algorithm S determines distinct values so, 51, ..., 
Sm, and rule R determines a subsequence of X,,, Xs5,,.-., Xs,, such that Xs, 
is the rth element of this subsequence. The set of all real y = (0.YoY,...)2 such 
that Ys, = Xs, for 0 < k < m also belongs to T(a,a2...a,), and this is a mea- 
surable set consisting of the finite union of dyadic subintervals Ip,...5,. Since there 
are only countably many such dyadic intervals, we see that T(a1a2...ar) is a 
countable union of dyadic intervals, and it is therefore measurable. Furthermore, 
this argument can be extended to show that the measure of T (a1... ap—1 0) equals 
the measure of T(a,...a,— 1), since the latter is a union of dyadic intervals 
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obtained from the former by requiring that Ys, = X., for 0 < k < m and 
Y,,, #Xs,,- Now since 


T(a1. «+ Q@r—1 0) U T (a1. «+ Qr—i 1) G T(a,. os ar—1), 


the measure of T(a1a2...ar) is at most one-half the measure of T (a1... ar—1). 
The inequality (32) follows by induction on r. 

Now that (32) has been established, the remainder of the proof is essentially 
to show that the binary representations of almost all real numbers are equidis- 
tributed. For 0 < e < 1, let B(r,€) be UT (a1 ... ar), where the union is taken 
over all binary strings a; . . . ay for which the number v (r) of ones among a1... ar 
satisfies 

|v(r) — 5r| > er. 
The number of such binary strings is C (r, €) = >> (7) summed over all values of k 
with |k — r| > er. Exercise 1.2.10-21 proves that C(r,e) < artle-r. hence 
by (32), ; 
B(r,¢) has measure < 27"C(r, €) < 2e°° ". (33) 
The next step is to define 
B* (r,e) = Bir, J) UB(r+1,—) UB(r+2,6)U---. 


; _2k oor : 
The measure of B*(r, e€) is at most )>,.,.2e~© *, and this is the remainder of a 
convergent series, so 


lim (measure of B*(r,¢)) = 0. (34) 
Too 


Now if x is a real number whose binary expansion (0.X9X1...)2 leads to an 
infinite sequence (X,,,)R that is not 1-distributed, and if v(r) denotes the number 
of ones in the first r elements of the latter sequence, then 


|u(r)/r — | >e, 


for some e > 0 and infinitely many r. This means z is in B*(r,¢) for all r. So 
finally we find that 
N(R, S) = |] (Bn: 
t>2 r>1 
and, by (34), 1,5, B*(r,1/t) has measure zero for all t. Hence N(R, S) has 
measure zero. J 


From the existence of binary sequences satisfying Definition R6, we can show 
the existence of [0..1) sequences that are random in this sense. For details, see 
exercise 36. The consistency of Definition R6 is thereby established. 


E. Random finite sequences. An argument was given above to indicate that 
it is impossible to define the concept of randomness for finite sequences: Any 
given finite sequence is as likely as any other. Still, nearly everyone would agree 
that the sequence 011101001 is “more random” than 101010101, and even the 
latter sequence is “more random” than 000000000. Although it is true that truly 
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random sequences will exhibit locally nonrandom behavior, we would expect such 
behavior only in a long finite sequence, not in a short one. 

Several ways to define the randomness of a finite sequence have been pro- 
posed, and only a few of the ideas will be sketched here. For simplicity, we shall 
restrict our consideration to the case of b-ary sequences. 


Given a b-ary sequence Xo, X1, ..., Xn—1, we can say that 
Pr(S(n)) =p, if |u(N)/N -p| < 1/VN, (35) 


where y(n) is the quantity appearing in Definition A at the beginning of this 
section. The sequence above can be called “k-distributed” if 


Pr(XpXn41--»Xn4e—-1 = 2122... Lk) & 1/0" (36) 


for all b-ary numbers 27122... £k. (Compare with Definition D. Unfortunately 
a sequence might turn out to be k-distributed by this new definition when it is 
not (k — 1)-distributed.) 

A definition of randomness may now be given analogous to Definition R1, 
as follows: 


Definition Q1. A b-ary sequence of length N is “random” if it is k-distributed 
(in the sense above) for all positive integers k < log, N. | 


According to this definition, for example, there are 178 nonrandom binary 
sequences of length 11: 


00000001111 10000000111 11000000011 11100000001 11110000000 
00000001110 10000000110 11000000010 11100000000 11010000000 
00000001101 10000000101 11000000001 10100000001 10110000000 
00000001011 10000000011 01000000011 01100000001 01110000000 
00000000111 


plus 01010101010 and all sequences with nine or more zeros, plus all sequences 
obtained from the preceding sequences by interchanging ones and zeros. 
Similarly, we can formulate a definition for finite sequences analogous to 
Definition R6. Let A be a set of algorithms, each of which is a selection-and- 
choice procedure that gives a subsequence (X;,,)R as in the proof of Theorem M. 


Definition Q2. The b-ary sequence Xo, Xi, ..., Xn—1 is (n,€)-random with 
respect to a set of algorithms A, if for every subsequence Xt, Xt, ..., Xt 
determined by an algorithm of A we have either m < n or 

1 1 


mato Xim) a b 


m 


<e forO<a<b. 


Here valzı,..., £m) is the number of a’s in the sequence z1, ..., £m. I 


(In other words, every sufficiently long subsequence determined by an algo- 
rithm of A must be approximately equidistributed.) The basic idea in this case 
is to let A be a set of “simple” algorithms; the number (and the complexity) of 
the algorithms in A can grow as N grows. 
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As an example of Definition Q2, let us consider binary sequences, and let A 
be just the following four algorithms: 


a) Take the whole sequence. 
b) Take alternate terms of the sequence, starting with the first. 
c) Take the terms of the sequence following a zero. 
d) Take the terms of the sequence following a one. 
Now a sequence Xo, X1, ..., X7 is (4, ;)-random with respect to A if: 
by (a), |&(Xo +X, + +X) z| < z, that is, if there are 3, 4, or 5 ones; 
by (b), l4 (Xo + X2 + Xa + X6) — 4| < 4, that is, if there are exactly 2 ones in 


even-numbered positions; 
by (c), there are three possibilities depending on how many zeros occupy posi- 
tions Xo, ..., Xe: If there are 2 or 3 zeros here, there is no condition 
to test (since n = 4); if there are 4 zeros, they must respectively be 
followed by two zeros and two ones; and if there are 5 zeros, they must 
respectively be followed by two or three zeros; 


by (d), we get conditions similar to those implied by (c). 


It turns out that only the following binary sequences of length 8 are (4, 4)- 


8 
random with respect to these rules: 


00001011 00101001 01001110 01101000 
00011010 00101100 01011011 01101100 
00011011 00110010 01011110 01101101 
00100011 00110011 01100010 01110010 
00100110 00110110 01100011 01110110 
00100111 00111001 01100110 


plus those obtained by interchanging 0 and 1 consistently. 

It is clear that we could make the set of algorithms so large that no sequences 
satisfy the definition, when n and € are reasonably small. A. N. Kolmogorov has 
proved that an (n, €)-random binary sequence will always exist, for any given N, 
if the number of algorithms in A does not exceed 


2ne?(1—e) 


e (37) 


This result is not nearly strong enough to show that sequences satisfying Defi- 
nition Q1 will exist, but the latter can be constructed efficiently using the 
procedure of Rees in exercise 3.2.2-21. A generalized spectral test, based on 
discrete Fourier transforms, can be used to test how well a sequence measures 
up to Definition Q1 [see A. Compagner, Physical Rev. E52 (1995), 5634-5645]. 

Still another interesting approach to a definition of randomness has been 
taken by Per Martin-Léf [Information and Control 9 (1966), 602-619]. Given 
a finite b-ary sequence Xi, ..., Xy, let I(X1,..., Xn) be the length of the 
shortest Turing machine program that generates this sequence. (Alternatively, 
we could use other classes of effective algorithms, such as those discussed in 
Section 1.1.) Then I(X1,...,Xy) is a measure of the “patternlessness” of 


1 
2 
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the sequence, and we may equate this idea with randomness. The sequences 
of length N that maximize I(X1,..., Xn) may be called random. (From the 
standpoint of practical random number generation by computer, this is, of course, 
the worst definition of “randomness” that can be imagined!) 

Essentially the same definition of randomness was given independently by 
G. Chaitin at about the same time; see JACM 16 (1969), 145-159. It is interest- 
ing to note that even though this definition makes no reference to equidistribution 
properties as our other definitions have, Martin-L6f and Chaitin have proved that 
random sequences of this type also have the expected equidistribution properties. 
In fact, Martin-Lof has demonstrated that such sequences satisfy all computable 
statistical tests for randomness, in an appropriate sense. 

For further developments in the definition of random finite sequences, see 
A. K. Zvonkin and L. A. Levin, Uspekhi Mat. Nauk 25,6 (November 1970), 
85-127 [English translation in Russian Math. Surveys 25,6 (November 1970), 
83-124]; L. A. Levin, Doklady Akad. Nauk SSSR 212 (1973), 548-550; L. A. 
Levin, Information and Control 61 (1984), 15-37. 


F. Pseudorandom numbers. It is comforting from a theoretical standpoint 
to know that random finite sequences of various flavors exist, but such theorems 
don’t answer the questions faced by real-world programmers. More recent devel- 
opments have led to a more relevant theory, based on the study of sets of finite 
sequences. More precisely, we consider multisets in which sequences may appear 
more than once. 

Let S be a multiset containing bit strings (binary sequences) of length N; 
we call S an N-source. Let $y denote the special N-source that contains all 2% 
possible N-bit strings. Each element of S represents a sequence that we might 
use as a source of pseudorandom bits; choosing different “seed” values leads to 
different elements of S. For example, S might be 


{B, B2...Byn | B; is the most significant bit of X;} (38) 


in the linear congruential sequence defined by Xj+ı = (aX; + c) mod 2°, where 
there is one string B, By... By for each of the 2° starting values Xo. 

The basic idea of pseudorandom sequences, as we have seen throughout this 
chapter, is to get N bits that appear to be random, although we rely only on 
a few “truly random” bits when we choose the seed value. In the example just 
considered, we need e truly random bits to select Xo; in general, selecting a 
member of S amounts to using lg |.S| truly random bits, after which we proceed 
deterministically. If N = 10° and |S| = 23?, we are getting more than 30,000 
“apparently random” bits for each truly random bit expended. With $y instead 
of S, we get no such amplification, because lg |$y| = N. 

What does it mean to be “apparently random”? A. C. Yao proposed a 
good definition in 1982: Consider any algorithm A that looks at a bit string 
B = Bı... By and outputs the value A(B) = 0 or 1. We may think of A asa 
test for randomness; for example, A might compute the distribution of runs of 
consecutive Os and 1s, outputting 1 if the run lengths differ significantly from 
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the expected distribution. Whatever A does, we can consider the probability 
P(A, S) that A(B) = 1 when B is a randomly chosen element of S, and we 
can compare it to the probability P(A, $y) that A(B) = 1 when B is a truly 
random bit string of length N. If P(A, S) is extremely close to P(A, $y) for all 
statistical tests A, we cannot tell the difference between the sequences of S and 
truly random binary sequences. 


Definition P. We say that an N-source S passes statistical test A with toler- 
ance € if |P(A,S)—P(A, $y)| < €. It fails the test if |P(A,S)—P(A, $n)| > € | 

The algorithm A need not be designed by statisticians. Any algorithm can be 
considered a statistical test for randomness, according to Definition P. We allow 
A to flip coins (that is, to use truly random bits) as it performs its calculations. 
The only requirement is that A must output 0 or 1. 

Well, actually there is another requirement: We insist that A must deliver 
its output in a reasonable time, at least on the average. We’re not interested in 
algorithms that will take many years to run, because we will never notice any 
disparities between S and $y if our computers cannot detect them during our 
lifetime. The sequences of S contain only lg|S| bits of information, so there 
surely are algorithms that will eventually detect the redundancy; but we don’t 
care, as long as S' is able to pass all the tests that really matter. 

These qualitative ideas can be quantified, as we will now see. The theory 
is rather subtle, but it is sufficiently beautiful and important that readers who 
take the time to study the details carefully will be amply rewarded. 

In the following discussion, the running time T(A) of an algorithm A on 
N-bit strings is defined to be the maximum of the expected number of steps 
needed to output A(B), maximized over all B € $y; the expected number is 
averaged over all coin flips made by the algorithm. 

The first step in our quantitative analysis is to show that we may restrict 
the tests to be of a very special kind. Let A; be an algorithm that depends only 
on the first k bits of the input string B = Bı... By, where 0 < k < N, and let 
AP(B) = (Ax(B) + Br+i1 +1) mod 2. Thus AP outputs 1 if and only if A, has 
successfully predicted By +1; we call AF a prediction test. 


Lemma P1. Let S bean N-source. If S fails test A with tolerance e, there is an 
integer k € {0,1,..., N—1} and a prediction test AP with T(AP) < T(A)+O(N) 
such that S fails AP with tolerance €/N. 


Proof. By complementing the output of A, if necessary, we may assume that 
P(A, S)— P(A, $n) > €. Consider the algorithms Fẹ that begin by flipping N — k 
coins and replacing By 41... By by random bits Bj, ,... Bly before executing A. 
Algorithm Fy is the same as A, while Fo acts on S as if A were acting on $y. Let 
Dk = P(F, S). Since Ypo (Pr+1 — Pk) = PN — Po = P(A, S) — P(A, $n) > €, 
there is some k such that pk+1 — pk > €/N. 

Let AP be the algorithm that performs the computations of Fy and predicts 
the value (F),(B) + B,,, + 1) mod 2; in other words, it outputs 


Aj (B) = (Fr(B) + Bey + Bh41) mod 2. (39) 
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A careful analysis of probabilities shows that P(AP, S) — P(AP, $y) = Pk+1 — Pk- 
(See exercise 40.) Jf 


Most N-sources S of practical interest are shift-symmetric in the sense that 
every substring B,... Bk, Bo... Br4i, ..., BN-k+1... By of length k has the 
same probability distribution. This holds, for example, when S corresponds 
to a linear congruential sequence as in (38). In such cases we can improve on 
Lemma P1 by taking k = N — 1: 


Lemma P2. If S is a shift-symmetric N-source that fails test A with tolerance €, 
there is an algorithm A’ with T(A’) < T(A) + O(N) that predicts By from 
Bı... By—ı with probability at least 4 + €/N. 


Proof. If P(A,S) > P(A, $n), let A’ be the AP in the proof of Lemma P1, 
but applied to By_;... By_ 0...0 instead of Bı... By. Then A’ has the same 
average behavior, because of shift-symmetry. If P(A,S) < P(A, $n), let A’ be 
1— Aj’ in the same fashion. Clearly P(A’,$y) = 3. 1 


Now let’s specialize S even more, by supposing that each of the sequences 
B,Bz...By has the form f(g(Xo)) f(g(g(Xo))) -- . f (g1(Xo)) as Xo ranges 
over some set X, where g is a permutation of X and f(x) is 0 or 1 for all 
x € X. Our linear congruential example satisfies this restriction, with X = 
{0,1,...,2°-1}, g(x) = (ax +c) mod 2°, and f(x) = most significant bit of x. 
Such N-sources will be called iterative. 


Lemma P3. If S is an iterative N-source that fails test A with tolerance e, there 
is an algorithm A’ with T(A’) < T(A) + O(N) that predicts Bı from Bz... By 
with probability at least $ + €/N. 


Proof. An iterative N-source is shift-symmetric, and so is its reflection SE = 
{By...B, | By... By € S}. Therefore Lemma P2 applies to S®. | 


The permutation g(x) = (az + c) mod 2° is easy to invert, in the sense that 
we can determine x from g(x) whenever a is odd. But many easily computed 
permutation functions are “one-way” —hard to invert —and we will see that 
this makes them provably good sources of pseudorandom numbers. 


Lemma P4. Let S be an iterative N-source corresponding to f, g, and X. If S 
fails test A with tolerance e, there is an algorithm G that correctly guesses f(x), 
given g(x), with probability > 4 + ¢/N, when x is a random element of X. The 
running time T(G) is at most T(A) + O(N)(T(f) + T(g)). 


Proof. Given y = g(x), the desired algorithm G computes Bz = f(g(x)), B3 = 
f(9(g(2))), ---, By = f(g (2)) and applies the algorithm A’ of Lemma P3. 
It guesses f(x) = Bı with probability > $ + €/N, because g is a permutation 
of X, and B,... By is the element of S corresponding to the seed value Xo for 
which we have g(Xo) = £. I 


In order to use Lemma P4, we need to amplify the ability to guess a single 
bit f(x) to an ability to guess x itself, given only the value of g(x). There is 
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a nice general way to do this, using the properties of Boolean functions, if we 
extend S so that many different functions f(x) need to be guessed. (However, 
the method is somewhat technical, so the first-time reader may want to skip 
down to Theorem G before looking closely at the details that follow.) 

Suppose G(z1... zr) is a binary-valued function on R-bit strings that is 
good at guessing a function of the form f(z1... zz) = (£121 +-:-+2RzR) mod 2 
for some fixed 7 = 21...2pR. It is convenient to measure the success of G by 
computing the expected value 


y= Fe (1) Cee ttiz renee) , (40) 


averaged over all possibilities for 21 ... zr. This is the sum of correct guesses 
minus incorrect guesses, divided by 2"; so if p is the probability that G is correct, 
we have s = p — (1 — p), or p = 4 + 4s. 

For example, suppose R = 4 and G(z1z223z4) = [21 Æ 22|[23 + 24 < 2]. This 
function has success rate s = ł (and p = {) if « = 1100, because it equals 
x-z mod 2 = (z1 + 22) mod 2 for all 4-bit strings z except 0111 or 1011. It also 
has success rate + when x = 0000, 0011, 1101, or 1110; so there are five plausible 
possibilities for x. The other eleven x’s make s < 0. 

The following algorithm magically discovers x in most cases when G is a 
successful guesser in the sense just described. More precisely, the algorithm 


constructs a short list that has a good chance of containing x. 


Algorithm L (Amplification of linear guesses). Given a binary-valued function 
G(z1... zp) and a positive integer k, this algorithm outputs a list of 2% binary 
sequences z = 21...X2R with the property that x is likely to be output when 
G(z1...2R) is a good approximation to the function (x121 +---+ £RZR) mod 2. 


L1. [Construct a random matrix.] Generate random bits B;; for 1 < i < k and 
Ll<j<R. 


L2. [Compute signs.] For 1 < i < R, and for all bit strings b = b1... bg, compute 


he(d) = Y(-areroeenten (42) 


c#0 


where e; is the R-bit string 0...010...0 having 1 in position 7, and where cB 
is the string dı ...dp with dj = (Bıjc1 +: -+ BkjCk) mod 2. (In other words 
the binary vector c1 .. . Ck is multiplied by the k x R binary matrix B.) The 
sum is taken over all 2% — 1 bit strings cy... cy Æ 0...0. It can be evaluated 
for each i with k - 2" additions and subtractions, using Yates’s method for 
the Hadamard transform; see the remarks following Eq. 4.6.4-(38). 


L3. [Output the guesses.] For all 2" choices of b = bı... bp, output the string 
x(b) = [hi(b) <0]... [hr(b) <0]. | 


To prove that Algorithm L works properly, we must show that a given 
string x will probably be output whenever it deserves to be. Notice first that 
if we change G to G’, where G’(z) = (G(z) + zj) mod 2, the original G(z) is 
a good approximation to «- zmod2 if and only if the new G’(z) is a good 
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approximation to (x + ej) -z mod 2, where e; is the unit-vector string defined in 
step L2. Moreover, if we apply the algorithm to G’ instead of G, we get 


hi(b) = Oe ia = (—1)% hi((b+ B;) mod 2) : 
cA#0 


where B; is column j of B. Therefore step L3 outputs the vectors x'(b) = 
z((b + Bj) mod 2) + ej, modulo 2. As b runs through all k-bit strings, so does 
(b+ Bj) mod 2, and the effect is to complement bit j of every x in the output. 
We need therefore prove only that the vector = 0...0 is likely to be 
output whenever G(z) is a good approximation to the constant function 0. We 
will show, in fact, that z(0...0) equals 0...0 in step L3 with high probability, 
whenever G(z) is a lot more likely to be 0 than 1 and k is sufficiently large. More 


precisely, the condition 
Di elie +0 
c#0 


holds for 1 < i < R with probability > E, if s = E((-1)¢)) is positive when 
averaged over all 2” possibilities for z and if k is large enough. 

The key observation is that, for each fixed c = c1... Ck # 0...0, the string 
d = cB is uniformly distributed: Every value of d occurs with probability 1/2”, 
because the bits of B are random. Furthermore, when c # d = ch... Ch, 
the strings d = cB and d’ = c'B are independent: Every value of the pair 
(d,d') occurs with probability 1/22”. Therefore we can argue as in the proof 
of Chebyshev’s inequality that, for any fixed i, the sum Yee is 
negative with probability at most 1/((2* — 1)s?). (Exercise 42 contains the 
details.) It follows that R/((2* — 1)s?) is an upper bound on the probability 
that x(0) is nonzero in step L3. 


Theorem G. Ifs = E((—1)%)+**) > 0 and 2¥ > [2R/s?], Algorithm L 
outputs x with probability > 4. The running time is O(k2* R) plus the time to 
make 2" R evaluations of G. J 


Now we are ready to prove that the muddle-square sequence of Eq. 3.2.2-(17) 
is a good source of (pseudo)random numbers. Suppose 2°-! < M = PQ < 2®, 
where P and Q are prime numbers of the form 4k + 3 in the respective ranges 
2(R-2)/2 < P < R12 2R/2 < Q < 2040/7, We will call M an R-bit 
Blum integer, because the importance of such numbers for cryptography was first 
pointed out by Manuel Blum [COMPCON 24 (Spring 1982), 133-137]. Blum 
originally suggested that P and Q both have R/2 bits, but Algorithm 4.5.4D 
shows that it is better to choose P and Q as stated here so that Q—P > .29x2®/2. 

Choose Xo at random in the range 0 < Xo < M, with Xo L M; also 
let Z be a random R-bit mask. We can construct an iterative N-source S 
by letting X be the set of all (x,z,m) that are possibilities for (Xo, Z, M), 
with the further restriction that x = a? (modulo m) for some a. The function 
g(x, z, m) = (x? mod m, z, m) is easily shown to be a permutation of X (see, for 
example, exercise 4.5.4-35). The function f(x,z,m) that extracts bits in this 
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iterative source is x - z mod 2. Our starting value (Xo, Z, M) isn’t necessarily 
in X, but g(Xo, Z, M) is uniformly distributed in X, because exactly four values 
of Xo have a given square X mod M. 


Theorem P. Let S be the N-source defined by the muddle-square method on 
R-bit moduli, and suppose S fails some statistical test A with tolerance e > 1/2N. 
Then we can construct an algorithm F that finds factors of random R-bit Blum 
integers M = PQ having the form described above, with success probability at 
least ¢/(4N) and with running time T(F) = O(N? R?«~?T(A) + N? Rte~?). 


Proof. Multiplication mod M can be done in O(R?) steps; hence T(f) +T(g) = 
O(R?). Lemma P4 therefore asserts the existence of a guessing algorithm G 
with success rate e/N and T(G) < T(A)+O(NR?). We can construct G from A 
using the method of exercise 41. This algorithm G has the property that s = 
E((—1)¢2™+#) > (4 + €/N) — (4 —€/N) = 2e/N, where the expected value 
is taken over all (a, z,m) € X, and where (y, z,m) = g(a, z,m). 

The desired algorithm F' proceeds as follows. Given a random M = PQ 
with unknown P and Q, it computes a random Xo between 0 and M, and stops 
immediately with a known factorization if gcd(Xo, M) 4 1. Otherwise it applies 
Algorithm L with G(z) = G(X} mod M,z, M) and k = [lg(1+2N?R/e?)]. If 
one of the 2* values x output by that algorithm satisfies z? = Xê (modulo M), 
there is a 50:50 chance that x # +X; then gcd(Xo — x, M) and gcd(Xo +2, M) 
are the prime factors of M. (See Rabin’s “SQRT box” in Section 4.5.4.) 

The running time of this algorithm is clearly O(N? R?e~?T(A) + N3Rte~2), 
since €e > 27N. The probability that it succeeds in factoring M can be esti- 
mated as follows. Let n = |X|/2" be the number of choices of (x, m), and 
let Spm = 227R Y (1) ewan) tee summed over all R-bit numbers z; thus s = 
Xam Sam/n > 2e/N. Let t be the number of (x, m) such that sym > €/N. The 
probability that our algorithm deals with such a pair (x,m) is 


t Sam Sam 
m= Dy [82m ZN] = ) (1 — [Sam <€/N]) = 
2€ Sam E 
aD [Sam < E/N] T ZN 


And in such a case it finds x with probability > E, by Theorem G, since we have 
2* > [2R/s2m l; so it finds a factor with probability >}. | 


What does Theorem P imply, from a practical standpoint? Our proof shows 
that the constant implied by the O is small; let us assume that the running 
time for factoring is at most 10(N?R?e~?T(A)+ N? Rte-?). Many of the world’s 
greatest mathematicians have worked on the problem of factoring large numbers, 
especially after factoring was shown to be highly relevant to cryptography in the 
late 1970s. Since they haven’t found a good solution, we have excellent reason 
to believe that factoring is hard; hence Theorem P will show that T(A) must be 
large on all algorithms that detect nonrandomness of muddle-square bits. 
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Long computations are conveniently measured in MIP-years, the number of 
instructions executed per Gregorian year by a machine that performs a million 
instructions per Gregorian second — namely 31,556,952,000,000 ~ 3.16 x 101°. In 
1995, the time to factor a number of 120 decimal digits (400 bits), using the most 
highly tuned algorithms, was more than 250 MIP-years. The most optimistic 
researchers who have worked on factorization would be astonished if an algorithm 
were discovered that requires only exp(R!/4(In R)°/) instructions as R > oo. 
But let us assume that such a breakthrough has been achieved, for at least a 
not-too-small fraction of the R-bit Blum integers M. Then we could factor many 
numbers of about 50000 bits (15000 digits) in 2 x 107° MIP-years. If we generate 
N = 1000 random bits by muddle-square with R = ao) and if we assume that 
all algorithms that are good enough to factor at least TaS of the 50000-bit Blum 
integers must run at least 2 x 107° MIP-years, Theorem P tells us that every 
such set of 1000 bits will pass all statistical tests for randomness whose running 
time T(A) is less than 70000 MIP-years: No such algorithm A will be able to 
distinguish such bits from a truly random sequence with probability > € = sa: 

Impressive? No. Such a result is hardly surprising, since we need to specify 
about 150000 truly random bits just to start up the muddle-square method with 
Xo, Z, and M when R = 50000. Of course we should be able to get 1000 random 
bits back from such an investment! 

But in general, the formula becomes 


T(A) 


under our conservative assumptions, when € = <a} the NR? term is negligible 
when R is large. So let’s set R = 200000 and N = 101°. Then we get ten billion 
pseudorandom muddle-bits from + 3R = 600000 truly random bits, passing all 
statistical tests that require fewer than 7.486 x 101° MIP-years = 74.86 gigaMIP- 
years. With R = 333333 and N = 10 the computation time needed to detect 
any statistical bias increases to 535 teraMIP-years. 

The simple pseudorandom generator 3.2.2-(16), which avoids the random 
mask Z, can also be shown to pass all polynomial-time tests for randomness if fac- 
toring is intractable. (See exercise 4.5.4—43.) But the known performance guar- 
antees for the simpler method are somewhat weaker than for muddle-square; cur- 
rently they are O(N*Re~‘ log(NRe~')) versus the O(N?R?e~?) of Theorem P. 

Everyone believes that there is no factoring algorithm for R-bit numbers 
whose running time is polynomial in R. If that conjecture is true in a stronger 
form, so that we cannot even factor 1/R* of the R-bit Blum integers in poly- 
nomial time for any fixed k, Theorem P proves that the muddle-square method 
generates pseudorandom numbers that pass all polynomial-time statistical tests 
for randomness. 

Stating this another way: If you generate random bits with the muddle- 
square method for suitably chosen N and R, you either get numbers that pass 
all reasonable statistical tests, or you get fame and fortune for discovering a new 
factorization algorithm. 


—2 1/4 3/4\ 2 
ATA R? exp(R'/*(n R)?” $) — NR?, 
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G. Summary, history, and bibliography. We have defined several degrees 
of randomness that a sequence might possess. 

An infinite sequence that is oo-distributed satisfies a great many useful 
properties that are expected of random sequences, and there is a rich theory con- 
cerning oo-distributed sequences. (The exercises below develop several important 
properties of such sequences that have not been mentioned in the text.) Defini- 
tion R1 is therefore an appropriate basis for theoretical studies of randomness. 

The concept of an co-distributed b-ary sequence was introduced in 1909 by 
Emile Borel. He essentially defined the concept of an (m, k)-distributed sequence, 
and showed that the b-ary representations of almost all real numbers are (m, k)- 
distributed for all m and k. He called such numbers entirely normal to base b, 
and he stated Theorem C informally without apparently realizing that it required 
proof [Rendiconti Circ. Mat. Palermo 27 (1909), 247-271, §12.] 

The notion of an oo-distributed sequence of real numbers, also called a 
completely equidistributed sequence, first appeared in a note by N. M. Korobov 
in Doklady Akad. Nauk SSSR 62 (1948), 21-22. Korobov and several of his 
colleagues developed the theory of such sequences quite extensively in a series 
of papers during the 1950s. Completely equidistributed sequences were inde- 
pendently studied by Joel N. Franklin, Math. Comp. 17 (1963), 28-59, in a 
paper that is particularly noteworthy because it was inspired by the problem 
of random number generation. The book Uniform Distribution of Sequences by 
L. Kuipers and H. Niederreiter (New York: Wiley, 1974) is an extraordinarily 
complete source of information about the rich mathematical literature concerning 
k-distributed sequences of all kinds. 

We have seen, however, that oo-distributed sequences need not be suffi- 
ciently haphazard to qualify completely as “random.” ‘Three definitions, R4, 
R5, and R6, were formulated above to provide the additional conditions; and 
Definition R6, in particular, seems to be an appropriate way to define the concept 
of an infinite random sequence. It is a precise, quantitative statement that may 
well coincide with the intuitive idea of true randomness. 

Historically, the development of these definitions was primarily influenced 
by the quest of R. von Mises for a good definition of “probability.” In Math. 
Zeitschrift 5 (1919), 52-99, von Mises proposed a definition similar in spirit 
to Definition R5, although stated too strongly (like our Definition R3) so that 
no sequences satisfying the conditions could possibly exist. Many people no- 
ticed this discrepancy, and A. H. Copeland [Amer. J. Math. 50 (1928), 535- 
552] suggested weakening von Mises’s definition by substituting what he called 
“admissible numbers” (or Bernoulli sequences). These are equivalent to oo- 
distributed [0..1) sequences in which all entries U„ have been replaced by 1 
if Un < p or by 0 if U, > p, for a given probability p. Thus Copeland was 
essentially suggesting a return to Definition R1. Then Abraham Wald showed 
that it is not necessary to weaken von Mises’s definition so drastically, and he 
proposed substituting a countable set of subsequence rules. In an important 
paper [Ergebnisse eines math. Kolloquiums 8 (Vienna: 1937), 38-72], Wald 
essentially proved Theorem W, although he made the erroneous assertion that 
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the sequence constructed by Algorithm W also satisfies the stronger condition 
that Pr(U„ € A) = measure of A, for all Lebesgue measurable A C [0..1). We 
have observed that no sequence can satisfy this property. 

The concept of “computability” was still very much in its infancy when 
Wald wrote his paper, and A. Church [Bull. Amer. Math. Soc. 46 (1940), 130- 
135] showed how the precise notion of “effective algorithm” could be added to 
Wald’s theory to make his definitions completely rigorous. The extension to 
Definition R6 was due essentially to A. N. Kolmogorov [Sankhya A25 (1963), 
369-376], who proposed Definition Q2 for finite sequences at the same time. 
Another definition of randomness for finite sequences, somewhere “between” Def- 
initions Q1 and Q2, had been formulated many years earlier by A. S. Besicovitch 
[Math. Zeitschrift 39 (1934), 146-156]. 

The publications of Church and Kolmogorov considered only binary se- 
quences for which Pr(X, = 1) = p for a given probability p. Our discussion 
in this section has been slightly more general, since a [0..1) sequence essentially 
represents all p at once. The von Mises-Wald—Church definition has been refined 
in yet another interesting way by J. V. Howard, Zeitschr. für math. Logik und 
Grundlagen der Math. 21 (1975), 215-224. 

Another important contribution was made by Donald W. Loveland [Zeitschr. 
für math. Logik und Grundlagen der Math. 12 (1966), 279-294], who discussed 
Definitions R4, R5, R6, and several intermediate concepts. Loveland proved that 
there are R5-random sequences that do not satisfy R4, thereby establishing the 
need for a stronger definition such as R6. In fact, he defined a rather simple 
permutation (f(n)) of the nonnegative integers, and an Algorithm W’ analogous 
to Algorithm W, such that 


Pr(U pin) > 4) — Ev Upay 2 9) 24 


for every R5-random sequence (Un) produced by Algorithm W’ when it is given 
an infinite set of subsequence rules Rg. 

Although Definition R6 is intuitively much stronger than R4, it is apparently 
not a simple matter to prove this rigorously, and for several years it was an open 
question whether or not R4 implies R6. Finally Thomas Herzog and James C. 
Owings, Jr., discovered how to construct a large family of sequences that satisfy 
R4 but not R6. [See Zeitschr. fiir math. Logik und Grundlagen der Math. 22 
(1976), 385-389.] 

Kolmogorov wrote another significant paper [Problemy Peredaci Informatsii 
1 (1965), 3-11] in which he considered the problem of defining the “information 
content” of a sequence, and this work led to Chaitin and Martin-L6f’s interesting 
definition of finite random sequences via “patternlessness.” [See IEEE Trans. 
IT-14 (1968), 662-664.) The ideas can also be traced to R. J. Solomonoff, 
Information and Control 7 (1964), 1-22, 224-254; IEEE Trans. IT-24 (1978), 
422-432; J. Computer and System Sciences 55 (1997), 73-88. 

For a philosophical discussion of random sequences, see K. R. Popper, The 
Logic of Scientific Discovery (London, 1959), especially the interesting construc- 
tion on pages 162-163, which he first published in 1934. 
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Further connections between random sequences and recursive function the- 
ory have been explored by D. W. Loveland, Trans. Amer. Math. Soc. 125 
(1966), 497-510. See also C.-P. Schnorr [Zeitschr. Wahr. verw. Geb. 14 (1969), 
27-35], who found strong relations between random sequences and the “species 
of measure zero” defined by L. E. J. Brouwer in 1919. Schnorr’s subsequent 
book Zufalligkeit und Wahrscheinlichkeit [Lecture Notes in Math. 218 (Berlin: 
Springer, 1971)] gives a detailed treatment of the entire subject of randomness 
and makes an excellent introduction to the ever-growing advanced literature on 
the topic. Important developments during the next two decades are surveyed 
in An Introduction to Kolmogorov Complexity and Its Applications (Springer, 
1993), by Ming Li and Paul M. B. Vitányi. 

The foundations of the theory of pseudorandom sequences and effective 
information were laid by Manuel Blum, Silvio Micali, and Andrew Yao [FOCS 
23 (1982), 80-91, 112-117; SICOMP 13 (1984), 850-864], who constructed the 
first explicit sequences that pass all feasible statistical tests. Blum and Micali 
introduced the notion of a “hard-core bit,” a Boolean function f such that f(x) 
and g(x) are easily computed although f(g [-1] (x)) is not; their paper was the 
origin of Lemma P4. Leonid Levin developed the theory further [Combinatorica 
7 (1987), 357-363], then he and Oded Goldreich [STOC 21 (1989), 25-32] 
analyzed algorithms such as the muddle-square method and showed that similar 
use of a mask yields hard-core bits in many further cases. Finally Charles Rackoff 
refined the methods of that paper by introducing and analyzing Algorithm L [see 
L. Levin, J. Symbolic Logic 58 (1993), 1102-1103]. 

Many other authors have contributed to the theory — notably Impagliazzo, 
Levin, Luby, and Hastad, who showed [SICOMP 28 (1999), 1364-1396] that 
pseudorandom sequences can be constructed from any one-way function — but 
such results are beyond the scope of this book. The practical implications of 
theoretical work on pseudorandomness were first investigated empirically by 
P. L’Ecuyer and R. Proulx, Proc. Winter Simulation Conf. 22 (1989), 467-476. 


If the numbers are not random, 
they are at least higgledy-piggledy. 


— GEORGE MARSAGLIA (1984) 


EXERCISES 

1. [10] Can a periodic sequence be equidistributed? 

2. [10] Consider the periodic binary sequence 0, 0, 1, 1, 0, 0, 1, 1, .... Is it 
1-distributed? Is it 2-distributed? Is it 3-distributed? 

3. [M22] Construct a periodic ternary sequence that is 3-distributed. 

4. [HM14] Prove that Pr(.$(n) and T(n)) + Pr($(n) or T(n)) = Pr($(n)) + Pr(T(n)), 
for any two statements S(n) and T(n), provided that at least three of the limits exist. 
For example, if a sequence is 2-distributed, we would find that 


Pr(ui < Un < v1 or u2 < Un41 < v2) = v1 — u1 + v2 — U2 — (vı — u1) (v2 — u2). 
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> 5. [HM22] Let U, = (2"8*1/3) mod 1. What is Pr(U„ < 4)? 


6. [HM23] Let S\(n), S2(n), ... be an infinite sequence of statements about mutually 
disjoint events; that is, S;(n) and S;(n) cannot simultaneously be true if i A j. Assume 
that Pr(S;(n)) exists for each j > 1. Show that Pr(Sj;(n) is true for some j > 1) > 
yo Pr(S;(n)), and give an example to show that equality need not hold. 


7. [HM27] Let {S:;(n)} be a family of statements such that Pr(.9;;(n)) exists for all 
i,j > 1. Assume that for all n > 0, Si;(n) is true for exactly one pair of integers i, j. 
I Negi Pr(Sij(n)) = 1, does it follow that “Pr(.9;;(n) is true for some j > 1)” exists 
for all i > 1, and that it equals >? 5, Pr(Sij(n))? 

8. [M15] Prove (13). 

9. [HM20] Prove Lemma E. [Hint: Consider (Yin — a)?.] 

> 10. [HM22] Where was the fact that m divides q used in the proof of Theorem C? 
11. [M10] Use Theorem C to prove that if a sequence (Un) is co-distributed, so is the 
subsequence (U2,). 


12. [HM20] Show that a k-distributed sequence passes the “maximum-of-k test,” in 


the following sense: Pr(u < max(Un, Un4i,...;Un+e-1) < v) =v — yk. 


> 13. [HM27] Show that an oo-distributed [0..1) sequence passes the “gap test” in the 
following sense: If 0 <a < 8 < 1 and p = $ — a, let f(0) = 0, and for n > 1 let f(n) 
be the smallest integer m > f(n — 1) such that œ < Um < 8; then 


Pr(f(n) — f(n- 1) =k) = p(l—p)**. 


14. [HM25] Show that an oo-distributed sequence passes the “run test” in the follow- 
ing sense: If f(0) = 0 and if, for n > 1, f(n) is the smallest integer m > f(n — 1) such 
that Um—1 > Um, then 


Pr(f(n) — f(n—1) =k) = 2k/(k +1)! — 2(k + 1)/(k + 2). 


> 15. [HM30] Show that an co-distributed sequence passes the “coupon-collector’s test” 
when there are only two kinds of coupons, in the following sense: Let Xi, X2, ... be 
an oo-distributed binary sequence. Let f(0) = 0, and for n > 1 let f(n) be the smallest 
integer m > f(n — 1) such that {Xf(n—1)41,---,Xm} is the set {0,1}. Prove that 
Pr(f(n) — f(n—1) =k) = 21, for k > 2. (See exercise 7.) 


16. [HM38] Does the coupon-collector’s test hold for co-distributed sequences when 
there are more than two kinds of coupons? (See the previous exercise.) 


17. [HM50] If r is any given rational number, Franklin has proved that the sequence 
(r” mod 1) is not 2-distributed. But is there any rational number r for which this 
sequence is equidistributed? In particular, is the sequence equidistributed when r = 3? 
[See K. Mahler, Mathematika 4 (1957), 122—124.] 


> 18. [HM22] Prove that if Uo, Ui, ... is k-distributed, so is the sequence Vo, Vi, ..., 
where Vp = |nUn]/n. 


19. [HM35] Consider a modification of Definition R4 that requires the subsequences 
to be only 1-distributed instead of oo-distributed. Is there a sequence that satisfies 
this weaker definition, but that is not co-distributed? (Is the weaker definition really 
weaker?) 
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> 20. [HM36] (N. G. de Bruijn and P. Erdős.) The first n points of any [0..1) sequence 
(Un) with Uo = 0 divide the interval [0..1) into n subintervals; let those subintervals 
have lengths i) > 1 >.> oo, Clearly 1) > A > i, because IP 4...41 = 1. 
One way to measure the equitability of the distribution of (Un) is to consider 


L = limsup nl? and L= liminf nl”. 


a) What are L and L for van der Corput’s sequence (29)? 
b) Show that tek > I) for 1 < k <n. Use this result to prove that L > 1/In2. 


c) Prove that L < 1/ln4. [Hint: For each n there are numbers a1, ..., @2n such that 
i) > ii for 1 < k < 2n. Moreover, each integer 2, ..., n occurs at most 
twice in {a1,...,@2n}-.] 


d) Show that the sequence (W,,) defined by Wn = lg(2n +1) mod 1 satisfies 1/In2 > 
nl > ni” > 1/ln 4 for all n; hence it achieves the optimum L and L. 
21. [HM40] (L. H. Ramshaw.) 
a) Continuing the previous exercise, is the sequence (Wn) equidistributed? 


b) Show that (Wn) is the only [0..1) sequence for which we have ee WW < 
lg(1 + k/n) whenever 1 <k <n. 


c) Let (fn(li,.-.,In)) be any sequence of continuous functions on the sets of n-tuples 
{(i,...,mn) |i >- > ln andl, +---+ln = 1}, satisfying the following two 
properties: 

fmn(<h, sewi 4h, lo, ET Łlə, hats a In, ee 4 In) = fn(li,..-,In)} 


f Dabija fri<k<n then falli,- -.,ln) 2 falh,- ln) 
[Examples are: nl, —nl™, W, nie? p-e ii) 1 Let 


F = lim sup fat, re as 
for the sequence (Wn). Show that fall, iss 1) < F for all n, with respect to 
(Wn); also lim suppa fr (?,...,19%) > F with respect to every other [0..1) 
sequence. 
> 22. [HM30] (Hermann Weyl.) Show that the [0..1) sequence (Un) is k-distributed if 
and only if 


1 . 
lim W X exp(2ri(c1Un +--+ ckUn+k-1)) = 0 


0<n<N 
for every set of integers ci, C2, ..., Ck not all zero. 


23. [M32] (a) Show that a [0..1) sequence (U,) is k-distributed if and only if all of 
the sequences ((e1 Un +02Un41 +++: +CkUn+k-1) mod 1) are 1-distributed, whenever c1, 
C2, .. . , Ck are integers not all zero. (b) Show that a b-ary sequence (Xn) is k-distributed 
if and only if all of the sequences ((c1Xn + c2Xn41 +++: + CkXn+k-1) mod b) are 1- 
distributed, whenever c1, c2, ..., Ce are integers with gcd(ci,...,cx) = 1. 

> 24. [M35] (J. G. van der Corput.) (a) Prove that the [0..1) sequence (Un) is equidis- 
tributed whenever the sequences ((Un+% — Un) mod 1) are equidistributed for all k > 0. 
(b) Consequently ((aan* + --- + ain + ao) mod 1) is equidistributed, when d > 0 and 
Qa is irrational. 
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25. [HM20] A sequence is called a “white sequence” if all serial correlations are zero; 
that is, if the equation in Corollary S is true for all k > 1. (By Corollary S, an o0- 
distributed sequence is white.) Show that if a [0..1) sequence is equidistributed, it is 
white if and only if 


a S> (Uj -— §)(Uj4e-— 3) =0, forall k>1. 


26. [HM34] (J. Franklin.) A white sequence, as defined in the previous exercise, can 
definitely fail to be random. Let Uo, Ui, ... be an co-distributed sequence, and define 
the sequence Vo, Vi, ... as follows: 


(Van—1, Van) = (U2n-1, Uon) if (U2n—1, U2n) E€ G, 
(Van—1, Van) = (Un, U2n—1) if (U2n—1, U2n) ¢ G, 


where G is the set 
{(z,y)|z-4<y<wvorr+s <y}. 


Show that (a) Vo, Vi, ... is equidistributed and white; (b) Pr(Vn > Vn+1) = $. (This 
points out the weakness of the serial correlation test.) 


27. [HM48] What is the highest possible value for Pr(Va > Vn+1) in an equidistrib- 
uted, white sequence? (D. Coppersmith has constructed such a sequence achieving the 
7 


value $.) 


28. [HM21] Use the sequence (11) to construct a [0 ..1) sequence that is 3-distributed, 
for which Pr(U2n > 4) = 3. 


29. [HM34| Let Xo, X1, ... be a (2k)-distributed binary sequence. Show that 
Pr(Xon =0)< s+ ‘oa fe 


30. [M39] Construct a binary sequence that is (2k)-distributed, and for which 


Pik, = 0) = : ef Wa eoa 


(Therefore the inequality in the previous exercise is the best possible.) 


31. [M30] Show that [0..1) sequences exist that satisfy Definition R5, yet vn/n > 5 
for all n > 0, where vn is the number of j < n for which U; < 3. (This might be 


2 
considered a nonrandom property of the sequence.) 


32. [M24] Given that (Xn) is a “random” b-ary sequence according to Definition R5, 
and that R is a computable subsequence rule that specifies an infinite subsequence 
(Xn)R, show that the latter subsequence is not only 1-distributed, it is “random” by 
Definition R5. 


33. [HM22] Let (U;,,) and (Us,,) be infinite disjoint subsequences of a sequence (Un). 
(Thus, ro < rı < r2 <--- and so < sı < s2 <--- are increasing sequences of integers 
and rm # Sn for any m,n.) Let (U:,,) be the combined subsequence, so that to < ti < 
t2 < --- and the set {tn} = {rn} U {sn}. Show that if Pr(U,,, € A) = Pr(U;, € A) =p, 
then Pr(U;,, € A) = p. 

34. [M25] Define subsequence rules Ri, R2, R3, ... such that Algorithm W can be 


used with these rules to give an effective algorithm to construct a [0..1) sequence 
satisfying Definition R1. 


3.5 WHAT IS A RANDOM SEQUENCE? 183 


> 35. [HM35] (D. W. Loveland.) Show that if a binary sequence (X,) is R5-random, 
and if (sn) is any computable sequence as in Definition R4, then Pr(Xs„ = 1) > 4 and 
Pr(Xs, = 1) <2. 
36. [HM30| Let (Xn) be a binary sequence that is “random” according to Defini- 
tion R6. Show that the [0..1) sequence (Un) defined in binary notation by the scheme 


Uo = (0.X0)2, U1 = (0.X1X2)2, U2 = (0.X3X4X5)2, U3 = (0.X6X7X8Xo9)o, 


is random in the sense of Definition R6. 


37. [M37] (D. Coppersmith.) Define a sequence that satisfies Definition R4 but not 
Definition R5. [Hint: Consider changing Up, U1, U4, U9, ... in a truly random sequence. | 


38. [M49] (A. N. Kolmogorov.) Given N, n, and e, what is the smallest number of 
algorithms in a set A such that no (n,¢)-random binary sequences of length N exist 
with respect to A? (If exact formulas cannot be given, can asymptotic formulas be 
found? The point of this problem is to discover how close the bound (37) comes to 
being “best possible.”) 

39. [HM45] (W. M. Schmidt.) Let U, be a [0..1) sequence, and let v,,(u) be the 
number of nonnegative integers j < n such that 0 < U; < u. Prove that there is a 
positive constant c such that, for any N and for any [0..1) sequence (Un), we have 


|Um(u) — un| > cln N 


for some n and u with 0 <n < N,0<u<1. (In other words, no [0..1) sequence can 
be too equidistributed.) 


40. [M28] Complete the proof of Lemma P1. 


41. [M21] Lemma P2 shows the existence of a prediction test, but its proof relies on 
the existence of a suitable k without explaining how we could find k constructively 
from A. Show that any algorithm A can be converted into an algorithm A’ with 
T(A’) < T(A) + O(N) that predicts By from By... By-—1 with probability at least 
++ (P(A, S) — P(A, $n))/N on any shift-symmetric N-source S. 
> 42. [M28] (Pairwise independence.) 
a) Let X1, ..., Xn be random variables having mean value u = E X; and variance 
o? = E X? — (EX;) for 1 < j < n. Prove Chebyshev’s inequality 


Pr((Xı +--+ Xn — nu)? > tno?) < 1/t, 


under the additional assumption that E(X:X;) = (E X) (E X4) whenever i # j. 
b) Let B be a random k x R binary matrix. Prove that if c and c’ are fixed nonzero 
k-bit vectors, with c Æ c’, the vectors cB and c'B are independent random R-bit 
vectors (modulo 2). 
c) Apply (a) and (b) to the analysis of Algorithm L. 


43. [20] It seems just as difficult to find the factors of any fixed R-bit Blum integer M 
as to find the factors of a random R-bit integer. Why then is Theorem P stated for 
random M instead of fixed M? 


> 44. [16] (I. J. Good.) Can a valid table of random digits contain just one misprint? 
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3.6. SUMMARY 


WE HAVE COVERED a fairly large number of topics in this chapter: How to 
generate random numbers, how to test them, how to modify them in applications, 
and how to derive theoretical facts about them. Perhaps the main question in 
many readers’ minds will be, “What is the result of all this theory? What is 
a simple, virtuous generator that I can use in my programs in order to have a 
reliable source of random numbers?” 

The detailed investigations in this chapter suggest that the following proce- 
dure gives the simplest random number generator for the machine language of 
most computers: At the beginning of the program, set an integer variable X to 
some value Xo. This variable X is to be used only for the purpose of random 
number generation. Whenever a new random number is required by the program, 
set 

X + (aX +c)modm (1) 
and use the new value of X as the random value. It is necessary to choose Xo, 
a, c, and m properly, and to use the random numbers wisely, according to the 
following principles: 


i) The “seed” number Xo may be chosen arbitrarily. If the program is run 
several times and a different source of random numbers is desired each 
time, set Xo to the last value attained by X on the preceding run; or (if 
more convenient) set Xo to the current date and time. If the program may 
need to be rerun later with the same random numbers (for example, when 
debugging), be sure to print out Xp if it isn’t otherwise known. 

ii) The number m should be large, say at least 2°°. It may conveniently be 
taken as the computer’s word size, since this makes the computation of 

(aX + c)modm quite efficient. Section 3.2.1.1 discusses the choice of m 
in more detail. The computation of (aX + c) mod m must be done exactly, 
with no roundoff error. 

iii) If m is a power of 2 (that is, if a binary computer is being used), pick a 
so that amod 8 = 5. If m is a power of 10 (that is, if a decimal computer 
is being used), choose a so that a mod 200 = 21. This choice of a together 
with the choice of c given below ensures that the random number generator 
will produce all m different possible values of X before it starts to repeat 
(see Section 3.2.1.2) and ensures high “potency” (see Section 3.2.1.3). 

iv) The multiplier a should preferably be chosen between .01m and .99m, and 
its binary or decimal digits should not have a simple, regular pattern. By 
choosing some haphazard constant like a = 3141592621 (which satisfies 
both of the conditions in (iii)), one almost always obtains a reasonably good 
multiplier. Further testing should of course be done if the random number 
generator is to be used extensively; for example, there should be no large 
quotients when Euclid’s algorithm is used to find the gcd of a and m (see 
Section 3.3.3). The multiplier should pass the spectral test (Section 3.3.4) 
and several tests of Section 3.3.2, before it is considered to have a truly clean 
bill of health. 
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v) The value of c is immaterial when a is a good multiplier, except that c must 
have no factor in common with m when m is the computer’s word size. Thus 
we may choose c = 1 or c= a. (People who use c = 0 together with m = 2° 
are sacrificing two bits of accuracy and half of the seed values just to save a 
few nanoseconds of running time; see exercise 3.2.1.2—9.) 

vi) The least significant (right-hand) digits of X are not very random, so de- 
cisions based on the number X should always be influenced primarily by 
the most significant digits. It is generally best to think of X as a random 
fraction X/m between 0 and 1, that is, to visualize X with a radix point at 
its left, rather than to regard X as a random integer between 0 and m — 1. 
To compute a random integer between 0 and k— 1, one should multiply by k 
and truncate the result. (Don’t divide by k; see exercise 3.4.1-3.) 

vii) An important limitation on the randomness of sequence (1) is discussed in 
Section 3.3.4, where it is shown that the “accuracy” in t dimensions will 
be only about one part in \¥/m. Monte Carlo applications requiring higher 
resolution can improve the randomness by employing techniques discussed 
in Section 3.2.2. 


viii) At most about m/1000 numbers should be generated; otherwise the future 
will behave more and more like the past. If m = 232, this means that a new 
scheme (for example, a new multiplier a) should be adopted after every few 
million random numbers are consumed. 


The comments above apply primarily to machine-language coding. Some of 
the ideas work fine also in higher-level languages for programming; for example, 
(1) becomes just ‘X=a*X+c’ in the C language, if X is of type unsigned long and 
if m is the modulus of unsigned long arithmetic (usually 2°? or 264). But C 
gives us no good way to regard X as a fraction, as required in (vi) above, unless 
we convert to double-precision floating point numbers. 

Another variant of (1) is therefore often used in languages like C: We choose 
m to be a prime number near the largest easily computed integer, and we let a 
be a primitive root of m; the appropriate increment c for this case is zero. Then 
(1) can be implemented entirely with simple arithmetic on numbers that remain 
between —m and +m, using the technique of exercise 3.2.1.1-9. For example, 
when a = 48271 and m = 23! — 1 (see line 20 of Table 3.3.4-1), we can compute 
X + aX mod m with the C code 


#define MM 2147483647 /* a Mersenne prime */ 
#define AA 48271 /* this does well in the spectral test */ 
#define QQ 44488 /* MM / AA */ 

#define RR 3399 /* MM % AA; it is important that RR<QQ */ 
X=AA* (XZQQ) -RR* (X/QQ) ; 

if (X<0) X+=MM; 


here X is type long, and X should be initialized to a nonzero seed value less 
than MM. Since MM is prime, the least-significant bits of X are just as random as 
the most-significant bits, so the precautions of (vi) no longer need to be taken. 
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If you need millions and millions of random numbers, you can combine that 
routine with another, as in Eq. 3.3.4-(38), by writing some additional code: 


#define MMM 2147483399 /* a non-Mersenne prime */ 
#define AAA 40692 /* another spectral success story */ 
#define QQQ 52774 /* MMM / AAA */ 

#define RRR 3791 /* MMM % AAA; again less than QQQ */ 


Y=AAA* (Y%,QQQ) -RRR* (Y/QQQ) ; 
if (Y<0) Y+=MMM; 
Z=X-Y; if (Z<=0) Z+=MM-1; 


Like X, the variable Y needs to be initially nonzero. This code deviates slightly 
from 3.3.4-(38) so that the output, Z, always lies strictly between 0 and 23! — 1, 
as recommended by Liviu Lalescu. The period length of the Z sequence is about 
74 quadrillion, and its numbers now have about twice as many bits of accuracy 
as the X numbers do. 

This method is portable and fairly simple, but not very fast. An alternative 
scheme based on lagged Fibonacci sequences with subtraction (exercise 3.2.2- 
23) is even more attractive, because it not only allows easy portability between 
computers, it is considerably faster, and it delivers random numbers of better 
quality because the ¢t-dimensional accuracy is probably good for t < 100. Here 
is a C subroutine ran_array(long aa[], int n) that generates n new random 
numbers and places them into a given array aa, using the recurrence 


Xj = (Xj-100 — Xj;~-37) mod 2°”. (2) 


This recurrence is particularly well suited to modern computers. The value of n 
must be at least 100; larger values like 1000 are recommended. 


#define KK 100 /* the long lag */ 
#define LL 37 /* the short lag */ 
#define MM (1L<<30) /* the modulus */ 
#define mod_diff(x,y) (((x)-(y))&(MM-1)) /* (x-y) mod MM */ 
long ran_x[KK]; /* the generator state */ 


void ran_array(long aa[],int n) { /* put n new values in aa */ 
register int i,j; 
for (j=0;j<KK;j++) aalj]=ran_x[j]; 
for (;j<n;jt+) aalj]=mod_diff(aa[lj-KK] ,aa[j-LL]) ; 
for (i=0;i<LL;it++,j++) ran_x[i]=mod_diff(aa[j-KK] ,aa[j-LL]); 
for (;i<KK;i++,j++) ran_x[i]=mod_diff(aa[j-KK] ,ran_x[i-LL]); 
} 


All information about numbers that will be generated by future calls to 
ran_array appears in ran_z, so you can make a copy of that array in the midst 
of a computation if you want to restart at the same point later without going 
all the way back to the beginning of the sequence. The tricky thing about using 
a recurrence like (2) is, of course, to get everything started properly in the first 
place, by setting up suitable values of Xo, ..., X99. The following subroutine 
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ran_start (long seed) initializes the generator nicely when given any seed number 
between 0 and 22° — 3 = 1,073,741,821 inclusive: 


#define TT 70 /* guaranteed separation between streams */ 
#define is_odd(x) ((x)&1) /* the units bit of x */ 


void ran_start(long seed) { /* use this to set up ran_array */ 
register int t,j; 
long x[KK+KK-1] ; /* the preparation buffer */ 
register long ss=(seed+2)&(MM-2) ; 
for (j=0;j<KK;j++) { 
x[jl=ss; /* bootstrap the buffer */ 
ss<<=1; if (ss>=MM) ss-=MM-2; /* cyclic shift 29 bits */ 
} 
x[1]++; /* make x[1] (and only x[1]) odd */ 
for (ss=seedk&(MM-1) ,t=TT-1; t; ) { 
for (j=KK-1;j>0;j--) 
x[j+j]=x[j], x[j+j-1]=0; /* "square" */ 
for (j=KK+KK-2;j>=KK;j--) 
x[j-(KK-LL)]=mod_diff (x[j-(KK-LL)] ,x[j]), 
x[j-KK]=mod_diff (x[j-KK] ,x[j]); 


if (is_odd(ss)) { /* "multiply by z" */ 
for (j=KK;j>0;j--) x[j]=x[j-1]; 
x[0]=x[KK] ; /* shift the buffer cyclically */ 
x[LL]=mod_diff(x[LL] ,x[KK]); 

} 


if (ss) ss>>=1; else t--; 
} 
for (j=0;j<LL;j++) ran_x[j+KK-LL]=x[j]; 
for (;j<KK;j++) ran_x[j-LL]=x[j]; 
for (j=0;j<10;j++) ran_array(x,KK+KK-1) ; /* warm it up */ 
} 


(This program incorporates improvements to the author’s original ran_start rou- 
tine, recommended by Richard Brent and Pedro Gimeno in November 2001.) 
The somewhat curious maneuverings of ran_start are explained in exercise 9, 
which proves that the sequences of numbers generated from different starting 
seeds are independent of each other: Every block of 100 consecutive values Xp, 
Xn+1; +--+; Xn+99 in the subsequent output of ran_array will be distinct from the 
blocks that occur with another seed. (Strictly speaking, this is known to be true 
only when n < 27°; but there are fewer than 255 nanoseconds in a year.) Several 
processes can therefore start in parallel with different seeds and be sure that they 
are doing independent calculations; different groups of scientists working on a 
problem in different computer centers can be sure that they are not duplicating 
the work of others if they restrict themselves to disjoint sets of seeds. Thus, more 
than one billion essentially disjoint batches of random numbers are provided by 
the single routines ran_array and ran_start. And if that is not enough, you can 
replace the program parameters 100 and 37 by other values from Table 3.2.2-1. 
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These C routines use the bitwise-and operation ‘&’ for efficiency, so they are 
not strictly portable unless the computer uses two’s complement representation 
for integers. Almost all modern computers are based on two’s complement 
arithmetic, but ‘&’ is not really necessary for this algorithm. Exercise 10 shows 
how to get exactly the same sequences of numbers in FORTRAN, using no such 
tricks. Although the programs illustrated here are designed to generate 30-bit 
integers, they are easily modified to generate random 52-bit fractions between 0 
and 1, on computers that have reliable floating point arithmetic; see exercise 11. 

You may wish to include ran_array in a library of subroutines, or you may 
find that somebody else has already done so. One way to check whether an 
implementation of ran_array and ran_start conforms with the code above is to 
run the following rudimentary test program: 


int main) { register int m; long a[2009]; 
ran_start (310952) ; 
for (m=0;m<2009;m++) ran_array(a,1009) ; 
printf ("%ld\n", ran_x[0]); 
ran_start (310952) ; 
for (m=0;m<1009;m++) ran_array(a, 2009) ; 
printf("%ld\n", ran_x[0]); return 0; 

} 


The printed output should be 995235265 (twice). 

Caution: The numbers generated by ran_array fail the birthday spacings 
test of Section 3.3.2J, and they have other deficiencies that sometimes show up 
in high-resolution simulations (see exercises 3.3.2-31 and 3.3.2-35). One way to 
avoid the birthday spacings problem is simply to use only half of the numbers 
(skipping the odd-numbered elements); but that doesn’t cure the other problems. 
An even better procedure is to follow Martin Ltischer’s suggestion, discussed in 
Section 3.2.2: Use ran_array to generate, say, 1009 numbers, but use only the first 
100 of these. (See exercise 15.) This method has modest theoretical support and 
no known defects. Most users will not need such a precaution, but it is definitely 
less risky, and it allows a convenient tradeoff between randomness and speed. 

A great deal is known about linear congruential sequences like (1), but 
comparatively little has yet been proved about the randomness properties of 
lagged Fibonacci sequences like (2). Both approaches seem to be reliable in 
practice, if they are used with the caveats already stated. 

When this chapter was first written in the late 1960s, a truly horrible random 
number generator called RANDU was commonly used on most of the world’s 
computers (see Section 3.3.4). The authors of many contributions to the science 
of random number generation have often been unaware that particular methods 
they were advocating would prove to be inadequate. A particularly noteworthy 
example was the experience of Alan M. Ferrenberg and his colleagues, reported 
in Physical Review Letters 69 (1992), 3382-3384: They tested their algorithms 
for a three-dimensional problem by considering first a related two-dimensional 
problem with a known answer, and discovered that supposedly super-quality 
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modern random number generators gave wrong results in the fifth decimal place. 
By contrast, an old-fashioned run-of-the-mill linear congruential generator, X + 
16807X mod (23! — 1), worked fine. Perhaps further research will show that even 
the random number generators recommended here are unsatisfactory; we hope 
this is not the case, but the history of the subject warns us to be cautious. The 
most prudent policy for a person to follow is to run each Monte Carlo program 
at least twice using quite different sources of random numbers, before taking 
the answers of the program seriously; this will not only give an indication of 
the stability of the results, it also will guard against the danger of trusting in a 
generator with hidden deficiencies. (Every random number generator will fail in 
at least one application.) 


Excellent bibliographies of the pre-1972 literature on random number gen- 
eration have been compiled by Richard E. Nance and Claude Overstreet, Jr., 
Computing Reviews 13 (1972), 495-508, and by E. R. Sowey, International 
Stat. Review 40 (1972), 355-371. The period 1972-1984 is covered by Sowey 
in International Stat. Review 46 (1978), 89-102; J. Royal Stat. Soc. A149 
(1986), 83-107. Subsequent developments are discussed by Shu Tezuka, Uniform 
Random Numbers (Boston: Kluwer, 1995). 

For a detailed study of the use of random numbers in numerical analysis, 
see J. M. Hammersley and D. C. Handscomb, Monte Carlo Methods (London: 
Methuen, 1964). This book shows that some numerical methods are enhanced 
by using numbers that are “quasirandom,” designed specifically for a certain 
purpose (not necessarily satisfying the statistical tests we have discussed). The 
origins of Monte Carlo methods for computers are discussed by N. Metropolis 
and R. Eckhardt in Stanislaw Ulam 1909-1984, a special issue of Los Alamos 
Science 15 (1987), 125-137. 

Every reader is urged to work exercise 6 in the following set of problems. 


Almost all good computer programs 
contain at least one random-number generator. 


— DONALD E. KNUTH, Seminumerical Algorithms (1969) 


EXERCISES 
1. [21] Write a MIX subroutine with the following characteristics, using method (1): 
Calling sequence: JMP RANDI 
Entry conditions: rA = k, a positive integer < 5000. 


Exit conditions: rA + a random integer Y, 1 < Y < k, with each integer 
about equally probable; rX =?; overflow off. 


2. [15] Some people have been afraid that computers will someday take over the 
world; but they are reassured by the statement that a machine cannot do anything 
really new, since it is only obeying the commands of its master, the programmer. 
Lady Lovelace wrote in 1844, “The Analytical Engine has no pretensions to originate 
anything. It can do whatever we know how to order it to perform.” Her statement 
has been elaborated further by many philosophers. Discuss this topic, with random 
number generators in mind. 
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3. [32] (A dice game.) Write a program that simulates a roll of two dice, each of 
which takes on the values 1, 2, ..., 6 with equal probability. If the total is 7 or 11 on 
the first roll, the game is won; a total of 2, 3, or 12 loses; and on any other total, call 
that total the “point” and continue rolling dice until either a 7 occurs (a loss) or the 
point occurs again (a win). 

Play ten games. The result of each roll of the dice should be printed in the form 
mn, where m and n are the contents of the two dice, followed by some appropriate 
comment (like “snake eyes” or “little Joe” or “the hard way”, etc.). 


A. [40] (Solitaire or patience.) Some people spend a lot of valuable time playing card 
games of solitaire, and perhaps automation will make an important inroad in this area. 
Write a program that (i) shuffles a simulated deck of cards; (ii) plays some common 
game of solitaire based on the order of the cards in the deck; and (iii) displays the result 
of the game, showing how close the program came to winning. Cumulative statistics 
should be kept. The program might be set up to “cheat” upon request. 


5. [46] (Creative writing by computer.) A television program entitled “The Thinking 
Machine,” broadcast by the CBS television network on October 26, 1960, featured 
(among other things) two Western-style playlets that were written by a computer 
program. Here are the two scripts as they were printed out by the computer: 


Saga #1. (The gun is in the right hand; the money is in the left hand; the glass is 
on the table; the bottle is on the table; the holster is on the robber; the sheriff's gun 
is in the sheriff’s right hand; the sheriff's holster is on the sheriff.) 

ROBBER: (The robber is at the window.) Go to door; open door; go thru door; close 
door; go to corner; put money down at corner; go to table; put gun on 
table; sit and fidget; sit at table; pick up glass with right hand (empty); 
put glass on table; pick up bottle with right hand; pour; put bottle on 
table; pick up glass with right hand; take a drink from glass; put glass on 
table; pick up bottle with right hand; sit at table; sit at table; go to corner; 
go to window; go to table; put bottle on table; sit and fidget; sit at table; 
sit and fidget; go to window; go to table; pick up glass with right hand. 

SHERIFF: (The sheriff is at the window.) See robber; (robber sees sheriff); go to 
door. 

ROBBER: Put glass on table; pick up gun with right hand; check gun. 

SHERIFF: Wait; open door; see robber; (robber sees sheriff); go thru door. 

ROBBER: Go to window; aim; fire; SHERIFF NICKED. 

SHERIFF: Go to window; aim; fire; MISSED; go to door; go to window. 

ROBBER: Go to door; aim; aim. 

SHERIFF: Aim; fire; MISSED. 

ROBBER: Fire; SHERIFF NICKED. 

SHERIFF: Go to door; aim; fire; MISSED; go thru door; aim. 

ROBBER: Aim; fire; MISSED; aim; fire; MISSED. 

SHERIFF: Fire; MISSED; go to window; aim; fire; MISSED. 

ROBBER: Aim; fire; MISSED; aim; fire; MISSED; aim; fire; SHERIFF NICKED. 

SHERIFF: Aim; fire; ROBBER HIT. 

ROBBER: Drop gun; robber dies. 

SHERIFF: Put gun in holster; go to table; pick up glass with right hand (empty); take 
glass from right hand with left hand; pick up bottle with right hand; pour; 
put bottle on table; take glass from left hand with right hand; take a drink 
from glass; take glass from right hand with left hand; pick up bottle with 
right hand; pour; put bottle on table; take glass from left hand with right 
hand; take a drink from glass; put glass on table; go to corner; pick up 
money with right hand; go to door; go thru door; close door. CURTAIN. 
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(The gun is in the right hand; the money is in the left hand; the glass is 


on the table; the bottle is on the table; the holster is on the robber; the sheriff's gun 
is in the sheriff’s right hand; the sheriff's holster is on the sheriff.) 


ROBBER: 


SHERIFF: 


ROBBER: 


SHERIFF: 
ROBBER: 
SHERIFF: 
ROBBER: 


SHERIFF: 
ROBBER: 


(The robber is at the window.) Go to door; open door; go thru door; 
close door; go to corner; put money down at corner; go to window; put 
gun down at window; lean on window and look; lean on window and look; 
go to corner; count money; go to table; pick up glass with right hand 
(empty); take glass from right hand with left hand; pick up bottle with 
right hand; pour; put bottle on table; take glass from left hand with right 
hand; take a drink from glass; put glass on table; pick up bottle with right 
hand; pour; go to corner; put bottle down at corner; go to window; pick 
up gun with right hand; check gun; put gun in holster; go to table; pick 
up glass with right hand; take a drink from glass; go to window; put glass 
down at window. 


(The sheriff is at the window.) See robber; (robber sees sheriff); go to 
door. 


Take gun from holster with right hand; check gun; go to door; check gun; 
put gun down at door. 


Open door; see robber; (robber sees sheriff); go thru door; go to window. 
Pick up gun with right hand. 
Go to table. 


Aim; fire; MISSED; aim; fire; SHERIFF HIT; blow out barrel; put gun in 
holster. 


Drop gun; sheriff dies. 


Go to corner; pick up money with right hand; go to door; go thru door; 
close door. CURTAIN. 


A careful reading of these scripts reveals the highly intense drama present here. 
The computer program was careful to keep track of the locations of each player, the 
contents of his hands, etc. Actions taken by the players were random, governed by 
certain probabilities; the probability of a foolish action was increased depending on 
how much that player had had to drink and on how often he had been nicked by a 
shot. The reader will be able to deduce further properties of the program by studying 
the sample scripts. 
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Of course, even the best scripts are rewritten before they are produced, and this 
is especially true when an inexperienced writer has prepared the original draft. Here 
are the scripts just as they were actually used in the show: 


Saga #1. Music up. 

MS Robber peering thru window of shack. 

CU Robber’s face. 

MS Robber entering shack. 

CU Robber sees whiskey bottle on table. 

CU Sheriff outside shack. 

MS Robber sees sheriff. 

LS Sheriff in doorway over shoulder of robber, both draw. 

MS Sheriff drawing gun. 

LS Shooting it out. Robber gets shot. 

MS Sheriff picking up money bags. 

MS Robber staggering. 

MS Robber dying. Falls across table, after trying to take last shot at sheriff. 
MS Sheriff walking thru doorway with money. 

MS of robber’s body, now still, lying across table top. Camera dollies back. (Laughter) 


Saga #2. Music up. 

CU of window. Robber appears. 

MS Robber entering shack with two sacks of money. 

MS Robber puts money bags on barrel. 

CU Robber—sees whiskey on table. 

MS Robber pouring himself a drink at table. Goes to count money. Laughs. 
MS Sheriff outside shack. 

MS thru window. 

MS Robber sees sheriff thru window. 

LS Sheriff entering shack. Draw. Shoot it out. 

CU Sheriff. Writhing from shot. 

M/2 shot Sheriff staggering to table for a drink . . . falls dead. 
MS Robber leaves shack with money bags.* 


[Note: CU = “close up”, MS = “medium shot”, etc. The details above were kindly 
furnished to the author by Thomas H. Wolf, producer of the television show, who sug- 
gested the idea of a computer-written playlet in the first place, and also by Douglas T. 
Ross and Harrison R. Morse who produced the computer program] 


In the summer of 1952, Christopher Strachey had used the hardware random 
number generator of the Ferranti Mark I to compose the following letter: 


Honey Dear 
My sympathetic affection beautifully attracts your affectionate enthusi- 
asm. You are my loving adoration: my breathless adoration. My fellow 
feeling breathlessly hopes for your dear eagerness. My lovesick adoration 
cherishes your avid ardour. 
Yours wistfully, 
M. U. C. 


[Encounter 3 (1954), 4, 25-31; another example appears in the article on Electronic 
Computers in the 64th edition of Pears Cyclopedia (London, 1955), 190-191.] 


* © 1962 by Columbia Broadcasting System, Inc. All Rights Reserved. Used by permission. 
For further information, see J. E. Pfeiffer, The Thinking Machine (New York: J. B. Lippin- 
cott, 1962). 
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The reader will undoubtedly have many ideas about how to teach a computer to 
do creative writing; and that is the point of this exercise. 


6. [40] Look at the subroutine library of each computer installation in your organi- 
zation, and replace the random number generators by good ones. Try to avoid being 
too shocked at what you find. 


7. [M40] A programmer decided to encipher his files by using a linear congruential 
sequence (Xn) of period 2°? generated by (1) with m = 2°*. He took the most significant 
bits |X,,/2'®| and exclusive-or’ed them onto his data, but kept the parameters a, c, 
and Xo secret. 

Show that this isn’t a very secure scheme, by devising a method that deduces the 
multiplier a and the first difference Xı — Xo in a reasonable amount of time, given only 
the values of |Xn/2'°| for 0 < n < 150. 


8. [M15] Suggest a good way to test whether an implementation of linear congruen- 
tial generators is working properly. 


9. [HM32] Let Xo, X1, ... be the numbers produced by ran_array after ran_start 
has initialized the generation process with seed s, and consider the polynomials 


Pr(z) = Xarea + Rage ee Re a gee i pe H Xapoa: 


a) Prove that P,(z) = z’)—" (modulo 2 and z!°° +237 +1), for some exponent h(s). 

b) Express h(s) in terms of the binary representation of s. 

c) Prove that if Xo, X{,... is the sequence of numbers produced by the same routines 
from the seed s’ # s, we have X,,,, = X},,, (modulo 2) for 0 < k < 100 only if 
In—n'| > 27-1. 

10. [22] Convert the C code for ran_array and ran_start to FORTRAN 77 subroutines 
that generate exactly the same sequences of numbers. 


11. [M25] Assuming that floating point arithmetic on numbers of type double is 
properly rounded in the sense of Section 4.2.2 (hence exact when the values are suitably 
restricted), convert the C routines ran_array and ran_start to similar programs that 
deliver double-precision random fractions in the range [0..1), instead of 30-bit integers. 


12. [M21] What random number generator would be suitable for a minicomputer that 
does arithmetic only on integers in the range [—32768 . . 32767]? 


13. [M25] Compare the subtract-with-borrow generators of exercise 3.2.1.1-12 to the 
lagged Fibonacci generators implemented in the programs of this section. 


14. [M35] (The future versus the past.) Let Xn = (Xn—37 + Xn—100) mod 2 and 
consider the sequence 


(Yo, Y1,-.-) = (Xo, X1,..., X99, X200, X201,..-, X299, X400, X401, . - - , X499, Xeo0,---)- 


(This sequence corresponds to calling ran_array(a, 200) repeatedly and looking only 
at the least significant bits, after discarding half of the elements.) The following 
experiment was repeated one million times using the sequence (Yn): “Generate 100 
random bits; then if 60 or more of them were 0, generate one more bit and print it.” 
The result was to print 14527 Os and 13955 1s; but the probability that 28482 random 
bits contain at most 13955 1s is only about .000358. 

Give a mathematical explanation why so many Os were output. 


15. [25] Write C code that makes it convenient to generate the random integers 
obtained from ran_array by discarding all but the first 100 of every 1009 elements, 
as recommended in the text. 


CHAPTER FOUR 


ARITHMETIC 


Seeing there is nothing (right well beloued Students in the Mathematickes) 
that is so troublesome to Mathematicall practise, nor that doth more molest 
and hinder Calculators, then the Multiplications, Diuisions, square and 
cubical Extractions of great numbers, which besides the tedious 

expence of time, are for the most part subiect to many slippery errors. 

I began therefore to consider in my minde, by what certaine and 

ready Art | might remoue those hindrances. 


— JOHN NEPAIR [NAPIER] (1616) 


| do hate sums. There is no greater mistake than to call arithmetic an exact 
Science. There are. . . hidden laws of Number which it requires a mind 
like mine to perceive. For instance, if you add a sum from the bottom up, 
and then again from the top down, the result is always different. 


— M. P. LA TOUCHE (1878) 


I cannot conceive that anybody will require multiplications at the rate 
of 40,000, or even 4,000 per hour; such a revolutionary change as the 
octonary scale should not be imposed upon mankind in general 

for the sake of a few individuals. 


— F. H. WALES (1936) 


Most numerical analysts have no interest in arithmetic. 
— B. PARLETT (1979) 


THE CHIEF PURPOSE of this chapter is to make a careful study of the four 
basic processes of arithmetic: addition, subtraction, multiplication, and divi- 
sion. Many people regard arithmetic as a trivial thing that children learn and 
computers do, but we will see that arithmetic is a fascinating topic with many 
interesting facets. It is important to make a thorough study of efficient meth- 
ods for calculating with numbers, since arithmetic underlies so many computer 
applications. 

Arithmetic is, in fact, a lively subject that has played an important part in 
the history of the world, and it still is undergoing rapid development. In this 
chapter, we shall analyze algorithms for doing arithmetic operations on many 
types of quantities, such as “floating point” numbers, extremely large numbers, 
fractions (rational numbers), polynomials, and power series; and we will also 
discuss related topics such as radix conversion, factoring of numbers, and the 
evaluation of polynomials. 
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4.1. POSITIONAL NUMBER SYSTEMS 
THE WAY WE DO ARITHMETIC is intimately related to the way we represent the 
numbers we deal with, so it is appropriate to begin our study of the subject with 
a discussion of the principal means for representing numbers. 

Positional notation using base b (or radix b) is defined by the rule 


(. . . Q4302A4140.A4—1404—2... Jo 
= aed azb’ t ab” t a,b! t ao t abt ea gb aes (1) 


for example, (520.3)g = 5-67 +2- 6! + 0+3- 67! = 1925. Our conventional 
decimal number system is, of course, the special case when b is ten, and when 
the a’s are chosen from the “decimal digits” 0, 1, 2, 3, 4, 5, 6, 7, 8, 9; in this 
case the subscript b in (1) may be omitted. 

The simplest generalizations of the decimal number system are obtained 
when we take b to be an integer greater than 1 and when we require the a’s to 
be integers in the range 0 < ap < b. This gives us the standard binary (b = 2), 
ternary (b = 3), quaternary (b = 4), quinary (b = 5), ... number systems. In 
general, we could take b to be any nonzero number, and we could choose the a’s 
from any specified set of numbers; this leads to some interesting situations, as 
we shall see. 

The dot that appears between ag and a_, in (1) is called the radix point. 
(When b = 10, it is also called the decimal point, and when b = 2, it is sometimes 
called the binary point, etc.) Continental Europeans often use a comma instead 
of a dot to denote the radix point; the English formerly used a raised dot. 

The a’s in (1) are called the digits of the representation. A digit a, for large k 
is often said to be “more significant” than the digits a, for small k; accordingly, 
the leftmost or “leading” digit is referred to as the most significant digit and the 
rightmost or “trailing” digit is referred to as the least significant digit. In the 
standard binary system the binary digits are often called bits; in the standard 
hexadecimal system (radix sixteen) the hexadecimal digits zero through fifteen 
are usually denoted by 

either 0, 1, 2, 3, 4,5, 6, 7, 8, 9, a, b, c, d, e, f 
or 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F. 


The historical development of number representations is a fascinating story, 
since it parallels the development of civilization itself. We would be going far 
afield if we were to examine this history in minute detail, but it will be instructive 
to look at its main features here. 

The earliest forms of number representations, still found in primitive cul- 
tures, are generally based on groups of fingers, piles of stones, etc., usually with 
special conventions about replacing a larger pile or group of, say, five or ten 
objects by one object of a special kind or in a special place. Such systems lead 
naturally to the earliest ways of representing numbers in written form, as in 
the systems of Babylonian, Egyptian, Greek, Chinese, and Roman numerals; 
but such notations are comparatively inconvenient for performing arithmetic 
operations except in the simplest cases. 
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During the twentieth century, historians of mathematics have made exten- 
sive studies of early cuneiform tablets found by archæologists in the Middle 
East. These studies show that the Babylonian people actually had two distinct 
systems of number representation: The numbers used in everyday business 
transactions were written in a notation based on grouping by tens, hundreds, etc.; 
this notation was inherited from earlier Mesopotamian civilizations, and large 
numbers were seldom required. When more difficult mathematical problems 
were considered, however, Babylonian mathematicians made extensive use of 
a sexagesimal (radix sixty) positional notation that was highly developed at 
least as early as 1750 B.C. This notation was unique in that it was actually a 
floating point form of representation with exponents omitted; the proper scale 
factor or power of sixty was to be supplied by the context, so that, for example, 
the numbers 2, 120, 7200, and 5 were all written in an identical manner. 
The notation was especially convenient for multiplication and division, using 
auxiliary tables, since radix-point alignment had no effect on the answer. As 
examples of this Babylonian notation, consider the following excerpts from early 
tables: The square of 30 is 15 (which may also be read, “The square of $ is 1”); 
the reciprocal of 81 = (1 21)go is (44 26 40)g9; and the square of the latter is 
(32 55 18 31 6 40)so. The Babylonians had a symbol for zero, but because of 
their “floating point” philosophy, it was used only within numbers, not at the 
right end to denote a scale factor. For the interesting story of early Babylonian 
mathematics, see O. Neugebauer, The Exact Sciences in Antiquity (Princeton, 
N. J.: Princeton University Press, 1952), and B. L. van der Waerden, Science 
Awakening, translated by A. Dresden (Groningen: P. Noordhoff, 1954); see also 
D. E. Knuth, CACM 15 (1972), 671-677; 19 (1976), 108. 

Fixed point positional notation was apparently first conceived by the Maya 
Indians in central America some 2000 years ago; their radix-20 system was highly 
developed, especially in connection with astronomical records and calendar dates. 
They began to use a written sign for zero about A.D. 200. But the Spanish con- 
querors destroyed nearly all of the Maya books on history and science, so we have 
comparatively little knowledge about the degree of sophistication that native 
Americans had reached in arithmetic. Special-purpose multiplication tables have 
been found, but no examples of division are known. [See J. Eric S. Thompson, 
Contrib. to Amer. Anthropology and History 7 (Carnegie Inst. of Washington, 
1941), 37-67; J. Justeson, “Pratiche di calcolo nell’antica mesoamerica,” Storia 
della Scienza 2 (Rome: Istituto della Enciclopedia Italiana, 2001), 976—990.] 

Several centuries before Christ, the Greek people employed an early form 
of the abacus to do their arithmetical calculations, using sand and/or pebbles 
on a board that had rows or columns corresponding in a natural way to our 
decimal system. It is perhaps surprising to us that the same positional notation 
was never adapted to written forms of numbers, since we are so accustomed to 
decimal reckoning with pencil and paper; but the greater ease of calculating by 
abacus (since handwriting was not a common skill, and since abacus users need 
not memorize addition and multiplication tables) probably made the Greeks feel 
it would be silly even to suggest that computing could be done better on “scratch 
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paper.” At the same time Greek astronomers did make use of a sexagesimal 
positional notation for fractions, which they had learned from the Babylonians. 

Our decimal notation, which differs from the more ancient forms primarily 
because of its fixed radix point, together with its symbol for zero to mark an 
empty position, was developed first in India within the Hindu culture. The exact 
date when this notation first appeared is quite uncertain; about A.D. 600 seems 
to be a good guess. Hindu science was highly developed at that time, particularly 
in astronomy. The earliest known Hindu manuscripts that show decimal notation 
have numbers written backwards (with the most significant digit at the right), 
but soon it became standard to put the most significant digit at the left. 

The Hindu principles of decimal arithmetic were brought to Persia about 
A.D. 750, as several important works were translated into Arabic; a picturesque 
account of this development is given in a Hebrew document by Abraham Ibn 
Ezra, which has been translated into English in AMM 25 (1918), 99-108. Not 
long after this, al-Khwārizmī wrote his Arabic textbook on the subject. (As 
noted in Chapter 1, our word “algorithm” comes from al-Khwarizmi’s name.) 
His work was translated into Latin and was a strong influence on Leonardo 
Pisano (Fibonacci), whose book on arithmetic (A.D. 1202) played a major role 
in the spreading of Hindu-Arabic numerals into Europe. It is interesting to note 
that the left-to-right order of writing numbers was unchanged during these two 
transitions, although Arabic is written from right to left while Hindu and Latin 
scholars generally wrote from left to right. A detailed account of the subsequent 
propagation of decimal numeration and arithmetic into all parts of Europe during 
the period 1200-1600 has been given by David Eugene Smith in his History of 
Mathematics 1 (Boston: Ginn and Co., 1923), Chapters 6 and 8. 

Decimal notation was applied at first only to integer numbers, not to frac- 
tions. Arabic astronomers, who required fractions in their star charts and other 
tables, continued to use the notation of Ptolemy (the famous Greek astronomer), 
a notation based on sexagesimal fractions. This system still survives today in 
our trigonometric units of degrees, minutes, and seconds, and also in our units 
of time, as a remnant of the original Babylonian sexagesimal notation. Early 
European mathematicians also used sexagesimal fractions when dealing with 
noninteger numbers; for example, Fibonacci gave the value 


1° 29' Ta 42" Baty 4Y 40% 


as an approximation to the root of the equation z? + 2z? + 10x = 20. (The 
correct answer is 1° 22’ 7” 42” 337V 4Y 387 30% 507 157* 43* ....) 

The use of decimal notation also for tenths, hundredths, etc., in a similar 
way seems to be a comparatively minor change; but, of course, it is hard to 
break with tradition, and sexagesimal fractions have an advantage over decimal 
fractions because numbers such as 5 can be expressed exactly, in a simple way. 

Chinese mathematicians — who never used sexagesimals — were apparently 
the first people to work with the equivalent of decimal fractions, although their 
numeral system (lacking zero) was not originally a positional number system in 


the strict sense. Chinese units of weights and measures were decimal, so that 
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Tsu Ch’ung-Chih (who died in A.D. 500 or 501) was able to express an approxi- 
mation to 7 in the following form: 

3 chang, 1 ch’in, 4 ts’un, 1 fen, 5 li, 9 hao, 2 miao, 7 hu. 
Here chang, ..., hu are units of length; 1 hu (the diameter of a silk thread) equals 
1/10 miao, etc. The use of such decimal-like fractions was fairly widespread in 
China after about 1250. 

An embryonic form of truly positional decimal fractions appeared in a 10th- 
century arithmetic text, written in Damascus by an obscure mathematician 
named al-Uqlīdisī (“the Euclidean”). He occasionally marked the place of a 
decimal point, for example in connection with a problem about compound in- 
terest, the computation of 135 times (1.1)” for 1 < n < 5. [See A. S. Saidan, 
The Arithmetic of al-Uqlīdisī (Dordrecht: D. Reidel, 1975), 110, 114, 343, 355, 
481-485.] But he did not develop the idea very fully, and his trick was soon 
forgotten. Al-Samaw’al of Baghdad and Baku, writing in 1172, understood that 
v10 = 3.162277 . .. , but he had no convenient way to write such approximations 
down. Several centuries passed before decimal fractions were reinvented by a Per- 
sian mathematician, al-Kashi, who died in 1429. Al-Kashi was a highly skillful 
calculator, who gave the value of 27 as follows, correct to 16 decimal places: 


integer fractions 


0| 6/2/83} 1)/8}5)3)0)7)1)7/9)5)8) 6) 5 


This was by far the best approximation to 7 known until Ludolph van Ceulen 
laboriously calculated 35 decimal places during the period 1586-1610. 

Decimal fractions began to appear sporadically in Europe; for example, a 
so-called “Turkish method” was used to compute 153.5 x 16.25 = 2494.375. 
Giovanni Bianchini developed them further, with applications to surveying, prior 
to 1450; but like al-Uqlidis1, his work seems to have had little influence. Christof 
Rudolff and Francois Viete suggested the idea again in 1525 and 1579. Finally, 
an arithmetic text by Simon Stevin, who independently hit on the idea of decimal 
fractions in 1585, became popular. Stevin’s work, and the discovery of logarithms 
soon afterwards, made decimal fractions commonplace in Europe during the 
17th century. [For further remarks and references, see D. E. Smith, History of 
Mathematics 2 (1925), 228-247; V. J. Katz, A History of Mathematics (1993), 
225-228, 345-348; and G. Rosińska, Quart. J. Hist. Sci. Tech. 40 (1995), 17-32.] 

The binary system of notation has its own interesting history. Many prim- 
itive tribes in existence today are known to use a binary or “pair” system of 
counting (making groups of two instead of five or ten), but they do not count in 
a true radix-2 system, since they do not treat powers of 2 in a special manner. 
See The Diffusion of Counting Practices by Abraham Seidenberg, Univ. of Calif. 
Publ. in Math. 3 (1960), 215-300, for interesting details about primitive number 
systems. Another “primitive” example of an essentially binary system is the 
conventional musical notation for expressing rhythms and durations of time. 

Nondecimal number systems were discussed in Europe during the seven- 
teenth century. For many years astronomers had occasionally used sexagesimal 
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arithmetic both for the integer and the fractional parts of numbers, primarily 
when performing multiplication [see John Wallis, Treatise of Algebra (Oxford: 
1685), 18-22, 30]. The fact that any integer greater than 1 could serve as radix 
was apparently first stated in print by Blaise Pascal in De Numeris Multiplicibus, 
which was written about 1658 [see Pascal’s Œuvres Completes (Paris: Editions 
du Seuil, 1963), 84-89]. Pascal wrote, “Denaria enim ex instituto hominum, 
non ex necessitate nature ut vulgus arbitratur, et sane satis inepte, posita est”; 
i.e., “The decimal system has been established, somewhat foolishly to be sure, 
according to man’s custom, not from a natural necessity as most people think.” 
He stated that the duodecimal (radix twelve) system would be a welcome change, 
and he gave a rule for testing a duodecimal number for divisibility by nine. 
Erhard Weigel tried to drum up enthusiasm for the quaternary (radix four) 
system in a series of publications beginning in 1673. A detailed discussion of 
radix-twelve arithmetic was given by Joshua Jordaine, Duodecimal Arithmetick 
(London: 1687). 

Although decimal notation was almost exclusively used for arithmetic during 
that era, other systems of weights and measures were rarely if ever based on 
multiples of 10, and business transactions required a good deal of skill in adding 
quantities such as pounds, shillings, and pence. For centuries merchants had 
therefore learned to compute sums and differences of quantities expressed in pe- 
culiar units of currency, weights, and measures; thus they were doing arithmetic 
in nondecimal number systems. The common units of liquid measure in England, 
dating from the 13th century or earlier, are particularly noteworthy: 


2 gills = 1 chopin 
2 chopins = 1 pint 
2 pints = 1 quart 
2 quarts = 1 pottle 
2 pottles = 1 gallon 
2 gallons = 1 peck 
2 pecks = 1 demibushel 


2 demibushels = 1 bushel or firkin 
2 firkins = 1 kilderkin 
2 kilderkins = 1 barrel 
2 barrels = 1 hogshead 
2 hogsheads = 1 pipe 
2 pipes = 1 tun 


Quantities of liquid expressed in gallons, pottles, quarts, pints, etc. were essen- 
tially written in binary notation. Perhaps the true inventors of binary arithmetic 
were British wine merchants! 

The first known appearance of pure binary notation was about 1605 in some 
unpublished manuscripts of Thomas Harriot (1560-1621). Harriot was a creative 
man who first became famous by coming to America as a representative of Sir 
Walter Raleigh. He invented (among other things) a notation like that now used 
for “less than” and “greater than” relations; but for some reason he chose not 
to publish many of his discoveries. Excerpts from his notes on binary arithmetic 
have been reproduced by John W. Shirley, Amer. J. Physics 19 (1951), 452-454; 
Harriot’s discovery of binary notation was first cited by Frank Morley in The 
Scientific Monthly 14 (1922), 60-66. 

The first published treatment of the binary system appeared in the work of 
a prominent Cistercian bishop, Juan Caramuel de Lobkowitz, Mathesis Biceps 1 
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(Campaniæ: 1670), 45-48. Caramuel discussed the representation of numbers in 
radices 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, and 60 at some length, but gave no examples 
of arithmetic operations in nondecimal systems except in the sexagesimal case. 

Ultimately, an article by G. W. Leibniz [Mémoires de l’Académie Royale des 
Sciences (Paris, 1703), 110-116], which illustrated binary addition, subtraction, 
multiplication, and division, really brought binary notation into the limelight, 
and his article is usually referred to as the birth of radix-2 arithmetic. Leibniz 
later referred to the binary system quite frequently. He did not recommend it for 
practical calculations, but he stressed its importance in number-theoretical inves- 
tigations, since patterns in number sequences are often more apparent in binary 
notation than they are in decimal; he also saw a mystical significance in the fact 
that everything is expressible in terms of zero and one. Leibniz’s unpublished 
manuscripts show that he had been interested in binary notation as early as 
1679, when he referred to it as a “bimal” system (analogous to “decimal”). 

A careful study of Leibniz’s early work with binary numbers has been made 
by Hans J. Zacher, Die Hauptschriften zur Dyadik von G. W. Leibniz (Frankfurt 
am Main: Klostermann, 1973). Zacher points out that Leibniz was familiar with 
John Napier’s so-called “local arithmetic,” a way for calculating with stones 
that amounts to using a radix-2 abacus. [Napier had published the idea of local 
arithmetic as part three of his little book Rabdologiæ in 1617; it may be called the 
world’s first “binary computer,” and it is surely the world’s cheapest, although 
Napier felt that it was more amusing than practical. See Martin Gardner’s 
discussion in Knotted Doughnuts and Other Mathematical Entertainments (New 
York: Freeman, 1986), Chapter 8.] 

It is interesting to note that the important concept of negative powers to the 
right of the radix point was not yet well understood at that time. Leibniz asked 
James Bernoulli to calculate m in the binary system, and Bernoulli “solved” the 
problem by taking a 35-digit approximation to 7, multiplying it by 10°°, and 
then expressing this integer in the binary system as his answer. On a smaller 
scale this would be like saying that m ~ 3.14, and (314)ıo = (100111010); 
hence m in binary is 100111010! [See Leibniz, Math. Schriften, edited by C. I. 
Gerhardt, 3 (Halle: 1855), 97; two of the 118 bits in the answer are incorrect, due 
to computational errors.| The motive for Bernoulli’s calculation was apparently 
to see whether any simple pattern could be observed in this representation of r. 

Charles XII of Sweden, whose talent for mathematics perhaps exceeded that 
of all other kings in the history of the world, hit on the idea of radix-8 arithmetic 
about 1717. This was probably his own invention, although he had met Leibniz 
briefly in 1707. Charles felt that radix 8 or 64 would be more convenient 
for calculation than the decimal system, and he considered introducing octal 
arithmetic into Sweden; but he died in battle before decreeing such a change. 
[See The Works of Voltaire 21 (Paris: E. R. DuMont, 1901), 49; E. Swedenborg, 
Gentleman’s Magazine 24 (1754), 423-424.] 

Octal notation was proposed also in colonial America before 1750, by the 
Rev. Hugh Jones, professor at the College of William and Mary [see Gentleman’s 
Magazine 15 (1745), 377-379; H. R. Phalen, AMM 56 (1949), 461-465]. 
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More than a century later, a prominent Swedish-American civil engineer 
named John W. Nystrom decided to carry Charles XII’s plans a step further, 
by devising a complete system of numeration, weights, and measures based on 
radix-16 arithmetic. He wrote, “I am not afraid, or do not hesitate, to advocate a 
binary system of arithmetic and metrology. I know I have nature on my side; if I 
do not succeed to impress upon you its utility and great importance to mankind, 
it will reflect that much less credit upon our generation, upon our scientific men 
and philosophers.” Nystrom devised special means for pronouncing hexadecimal 
numbers; for example, (C0160)16 was to be read “vybong, bysanton.” His entire 
system was called the Tonal System, and it is described in J. Franklin Inst. 46 
(1863), 263-275, 337-348, 402-407. A similar system, but using radix 8, was 
worked out by Alfred B. Taylor [Proc. Amer. Pharmaceutical Assoc. 8 (1859), 
115-216; Proc. Amer. Philosophical Soc. 24 (1887), 296-366]. Increased use of 
the French (metric) system of weights and measures prompted extensive debate 
about the merits of decimal arithmetic during that era; indeed, octal arithmetic 
was even being proposed in France [J. D. Collenne, Le Système Octaval (Paris: 
1845); Aimé Mariage, Numération par Huit (Paris: Le Nonnant, 1857)]. 

The binary system was well known as a curiosity ever since Leibniz’s time, 
and about 20 early references to it have been compiled by R. C. Archibald 
[AMM 25 (1918), 139-142]. It was applied chiefly to the calculation of powers, 
as explained in Section 4.6.3, and to the analysis of certain games and puzzles. 
Giuseppe Peano [Atti della R. Accademia delle Scienze di Torino 34 (1898), 47- 
55] used binary notation as the basis of a “logical” character set of 256 symbols. 
Joseph Bowden [Special Topics in Theoretical Arithmetic (Garden City: 1936), 
49] gave his own system of nomenclature for hexadecimal numbers. 

The book History of Binary and Other Nondecimal Numeration by Anton 
Glaser (Los Angeles: Tomash, 1981) contains an informative and nearly complete 
discussion of the development of binary notation, including English translations 
of many of the works cited above [see Historia Math. 10 (1983), 236-243]. 

Much of the recent history of number systems is connected with the develop- 
ment of calculating machines. Charles Babbage’s notebooks for 1838 show that 
he considered using nondecimal numbers in his Analytical Engine [see M. V. 
Wilkes, Historia Math. 4 (1977), 421]. Increased interest in mechanical devices 
for arithmetic, especially for multiplication, led several people in the 1930s to 
consider the binary system for this purpose. A particularly delightful account of 
such activity appears in the article “Binary Calculation” by E. William Phillips 
[Journal of the Institute of Actuaries 67 (1936), 187-221] together with a record 
of the discussion that followed a lecture he gave on the subject. Phillips began by 
saying, “The ultimate aim [of this paper] is to persuade the whole civilized world 
to abandon decimal numeration and to use octonal [that is, radix 8] numeration 
in its place.” 

Modern readers of Phillips’s article will perhaps be surprised to discover that 
a radix-8 number system was properly referred to as “octonary” or “octonal,” 
according to all dictionaries of the English language at that time, just as the 
radix-10 number system is properly called either “denary” or “decimal”; the 
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word “octal” did not appear in English language dictionaries until 1961, and it 
apparently originated as a term for the base of a certain class of vacuum tubes. 
The word “hexadecimal,” which has crept into our language even more recently, 
is a mixture of Greek and Latin stems; more proper terms would be “senidenary” 
or “sedecimal” or even “sexadecimal,” but the latter is perhaps too risqué for 
computer programmers. 

The comment by Mr. Wales that is quoted at the beginning of this chapter 
has been taken from the discussion printed with Phillips’s paper. Another man 
who attended the same lecture objected to the octal system for business purposes: 
“5% becomes 3.1463 per 64, which sounds rather horrible.” 

Phillips got the inspiration for his proposals from an electronic circuit that 
was capable of counting in binary [C. E. Wynn-Williams, Proc. Roy. Soc. London 
A136 (1932), 312-324]. Electromechanical and electronic circuitry for general 
arithmetic operations was developed during the late 1930s, notably by John V. 
Atanasoff and George R. Stibitz in the U.S.A., L. Couffignal and R. Valtat in 
France, Helmut Schreyer and Konrad Zuse in Germany. All of these inventors 
used the binary system, although Stibitz later developed excess-3 binary-coded- 
decimal notation. A fascinating account of these early developments, including 
reprints and translations of important contemporary documents, appears in 
Brian Randell’s book The Origins of Digital Computers (Berlin: Springer, 1973). 

The first American high-speed computers, built in the early 1940s, used 
decimal arithmetic. But in 1946, an important memorandum by A. W. Burks, 
H. H. Goldstine, and J. von Neumann, in connection with the design of the first 
stored-program computers, gave detailed reasons for making a radical departure 
from tradition and using base-two notation [see John von Neumann, Collected 
Works 5, 41-65]. Since then binary computers have multiplied. After a dozen 
years of experience with binary machines, a discussion of the relative advantages 
and disadvantages of radix-2 notation was given by W. Buchholz in his paper 
“Fingers or Fists?” [CACM 2,12 (December 1959), 3-11]. 

The MIX computer used in this book has been defined so that it can be 
either binary or decimal. It is interesting to note that nearly all MIX programs 
can be expressed without knowing whether binary or decimal notation is being 
used— even when we are doing calculations involving multiple-precision arith- 
metic. Thus we find that the choice of radix does not significantly influence 
computer programming. (Noteworthy exceptions to this statement, however, are 
the “Boolean” algorithms discussed in Section 7.1; see also Algorithm 4.5.2B.) 


There are several different ways to represent negative numbers in a computer, 
and this sometimes influences the way arithmetic is done. In order to understand 
these notations, let us first consider MIX as if it were a decimal computer; then 
each word contains 10 digits and a sign, for example 


—12345 67890. (2) 


This is called the signed magnitude representation. Such a representation agrees 
with common notational conventions, so it is preferred by many programmers. A 
potential disadvantage is that minus zero and plus zero can both be represented, 
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while they usually should mean the same number; this possibility requires some 
care in practice, although it turns out to be useful at times. 

Most mechanical calculators that do decimal arithmetic use another system 
called ten’s complement notation. If we subtract 1 from 00000 00000, we get 
99999 99999 in this notation; in other words, no explicit sign is attached to the 
number, and calculation is done modulo 10!°. The number —12345 67890 would 
appear as 

87654 32110 (3) 


in ten’s complement notation. It is conventional to regard any number whose 
leading digit is 5, 6, 7, 8, or 9 as a negative value in this notation, although 
with respect to addition and subtraction there is no harm in regarding (3) as 
the number +87654 32110 if it is convenient to do so. Notice that there is no 
problem of minus zero in such a system. 

The major difference between signed magnitude and ten’s complement no- 
tations in practice is that shifting right does not divide the magnitude by ten; 
for example, the number —11 = ... 99989, shifted right one, gives ...99998 = —2 
(assuming that a shift to the right inserts “9” as the leading digit when the num- 
ber shifted is negative). In general, x shifted right one digit in ten’s complement 
notation will give |x/10], whether x is positive or negative. 

A possible disadvantage of the ten’s complement system is the fact that 
it is not symmetric about zero; the p-digit negative number 500...0 is not the 
negative of any p-digit positive number. Thus it is possible that changing x to —x 
will cause overflow. (See exercises 7 and 31 for a discussion of radix-complement 
notation with infinite precision.) 

Another notation that has been used since the earliest days of high-speed 
computers is called nines’ complement representation. In this case the number 
—12345 67890 would appear as 


87654 32109. (4) 


Each digit of a negative number (—2) is equal to 9 minus the corresponding digit 
of x. It is not difficult to see that the nines’ complement notation for a negative 
number is always one less than the corresponding ten’s complement notation. 
Addition and subtraction are done modulo 10!° — 1, which means that a carry 
off the left end is to be added at the right end. (See the discussion of arithmetic 
modulo w -— 1 in Section 3.2.1.1.) Again there is a potential problem with minus 
zero, since 99999 99999 and 00000 00000 denote the same value. 

The ideas just explained for radix-10 arithmetic apply in a similar way to 
radix-2 arithmetic, where we have signed magnitude, two’s complement, and 
ones’ complement notations. Two’s complement arithmetic on n-bit numbers 
is arithmetic modulo 2”; ones’ complement arithmetic is modulo 2” —1. The 
MIX computer, as used in the examples of this chapter, deals only with signed 
magnitude arithmetic; however, alternative procedures for complement notations 
are discussed in the accompanying text when it is important to do so. 

Detail-oriented readers and copy editors should notice the position of the 
apostrophe in terms like “two’s complement” and “ones’ complement”: A two’s 
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complement number is complemented with respect to a single power of 2, while 
a ones’ complement number is complemented with respect to a long sequence 
of 1s. Indeed, there is also a “twos’ complement notation,” which has radix 3 
and complementation with respect to (2...22)3. 

Descriptions of machine language often tell us that a computer’s circuitry 
is set up with the radix point at a particular place within each numeric word. 
Such statements should usually be disregarded. It is better to learn the rules 
concerning where the radix point will appear in the result of an instruction if 
we assume that it lies in a certain place beforehand. For example, in the case 
of MIX we could regard our operands either as integers with the radix point at 
the extreme right, or as fractions with the radix point at the extreme left, or as 
some mixture of these two extremes; the rules for the appearance of the radix 
point after addition, subtraction, multiplication, or division are straightforward. 


It is easy to see that there is a simple relation between radix b and radix b*: 
(. --43020109.d_1Qa_2.. -Jo = (. . A3 AA Ao.A—1 A-9 ae pk; (5) 
where 
Aj = (üAkj+k-1 - - - Gkj 414k; )b; 
see exercise 8. Thus we have simple techniques for converting at sight between, 
say, binary and hexadecimal notation. 

Many interesting variations on positional number systems are possible in 
addition to the standard b-ary systems discussed so far. For example, we might 
have numbers in base (—10), so that 

(. . . 43020 1a9.a_14_2... )—10 
=... + a3(—10)? + a2(—10)? + aı(—10)! + ao +- 
= +++ — 1000a3 + 100a2 — 10a; + ao — ġa- + 150-2 


Here the individual digits satisfy 0 < apx < 9 just as in the decimal system. The 
number 12345 67890 appears in the “negadecimal” system as 


(1 93755 73910)—10, (6) 


since the latter represents 10305070900 — 9070503010. It is interesting to note 
that the negative of this number, —12345 67890, would be written 


(28466 48290)—10, (7) 


and, in fact, every real number whether positive or negative can be represented 
without a sign in the —10 system. 

Negative-base systems were first considered by Vittorio Grünwald [Giornale 
di Matematiche di Battaglini 23 (1885), 203-221, 367], who explained how to 
perform the four arithmetic operations in such systems; Grünwald also discussed 
root extraction, divisibility tests, and radix conversion. However, his work seems 
to have had no effect on other research, since it was published in a rather 
obscure journal, and it was soon forgotten. The next publication about negative- 
base systems was apparently by A. J. Kempner [AMM 43 (1936), 610-617], 
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who discussed the properties of noninteger radices and remarked in a footnote 
that negative radices would be feasible too. After twenty more years the idea 
was rediscovered again, this time by Z. Pawlak and A. Wakulicz [Bulletin de 
l’Académie Polonaise des Sciences, Classe III, 5 (1957), 233-236; Série des 
sciences techniques 7 (1959), 713-721], and also by L. Wadel [IRE Transactions 
EC-6 (1957), 123]. Experimental computers called SKRZAT 1 and BINEG, which 
used —2 as the radix of arithmetic, were built in Poland in the late 1950s; 
see N. M. Blachman, CACM 4 (1961), 257; R. W. Marczyński, Ann. Hist. 
Computing 2 (1980), 37-48. For further references see IEEE Transactions EC- 
12 (1963), 274-277; Computer Design 6 (May 1967), 52-63. There is evidence 
that the idea of negative bases occurred independently to quite a few people. For 
example, D. E. Knuth had discussed negative-radix systems in 1955, together 
with a further generalization to complex-valued bases, in a short paper submitted 
to a “science talent search” contest for high-school seniors. 

The base 2i gives rise to a system called the “quater-imaginary” number 
system (by analogy with “quaternary”), which has the unusual feature that every 
complex number can be represented with the digits 0, 1, 2, and 3 without a sign. 
[See D. E. Knuth, CACM 3 (1960), 245-247; 4 (1961), 355.] For example, 


(11210.31)2; = 1-16 +1-(—8i) +2-(—4) +1- (2) +3- (—4i)+1(—4) = 72-751. 


Here the number (agp, ...@1@9.€_-1...@—2k)2i is equal to 


(dan -.-49209.A_2... Q_2%)—4 + 2i(Gon—1 ---43Q1.0_1... GQ_2k41)—4; 

so conversion to and from quater-imaginary notation reduces to conversion to and 
from negative quaternary representation of the real and imaginary parts. The 
interesting property of this system is that it allows multiplication and division 
of complex numbers to be done in a fairly unified manner without treating real 
and imaginary parts separately. For example, we can multiply two numbers in 
this system much as we do with any base, merely using a different carry rule: 
Whenever a digit exceeds 3 we subtract 4 and carry —1 two columns to the left; 
when a digit is negative, we add 4 to it and carry +1 two columns to the left. 
The following example shows this peculiar carry rule at work: 


12231 [9-103] 


x12231 [9-10 
12231 
10320213 
13022 
13022 
12231 
021333121 [=19-— 180i] 


A similar system that uses just the digits 0 and 1 may be based on v2i, 
but this requires an infinite nonrepeating expansion for the simple number “i” 
itself. Vittorio Grünwald proposed using the digits 0 and 1/\/2 in odd-numbered 
positions, to avoid such a problem; but that actually spoils the whole system [see 
Commentari dell’Ateneo di Brescia (1886), 43-54]. 
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+ - +— + i 
—1+2 +i +1+i 


+1 


_e e 
—1-1 —i +1—i 
Fig. 1. The fractal set S called the “twindragon.” 


Another “binary” complex number system may be obtained by using the 
base i — 1, as suggested by W. Penney [JACM 12 (1965), 247-248]: 


( . . . @40302Q109.a_1... Ji—1 
=:--—4da,4 (2i t 2)a3 2ia2 4 (i 1)aı + ao l(i t 1)a—ı fee. 


In this system, only the digits 0 and 1 are needed. One way to demonstrate that 
every complex number has such a representation is to consider the interesting 
set S shown in Fig. 1; this set is, by definition, all points that can be written as 
e214 (i — 1)~*, for an infinite sequence a1, a2, a3, ... of zeros and ones. It is 
also known as the “twindragon fractal” [see M. F. Barnsley, Fractals Everywhere, 
second edition (Academic Press, 1993), 306, 310]. Figure 1 shows that S can be 
decomposed into 256 pieces congruent to +S. Notice that if the diagram of S 


16 
is rotated counterclockwise by 135°, we obtain two adjacent sets congruent to 
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(1/2) S, because (i — 1)9 = SU (S +1). For details of a proof that S contains 
all complex numbers that are of sufficiently small magnitude, see exercise 18. 

Perhaps the prettiest number system of all is the balanced ternary notation, 
which consists of radix-3 representation using —1, 0, and +1 as “trits” (ternary 
digits) instead of 0, 1, and 2. If we let the symbol 1 stand for —1, we have the 
following examples of balanced ternary numbers: 


Balanced ternary Decimal 
101 8 
1110.11 325 
1110.11 —328 
1110 —33 

0.11111... 4 


One way to find the representation of a number in the balanced ternary 
system is to start by representing it in ordinary ternary notation; for example, 


208.3 = (21201.022002200220...)s. 


(A very simple pencil-and-paper method for converting to ternary notation is 
given in exercise 4.4-12.) Now add the infinite number ...11111.11111... in 
ternary notation; we obtain, in the example above, the infinite number 


(...11111210012.210121012101...)3. 
Finally, subtract ...11111.11111... by decrementing each digit; we get 
208.3 = (101101.101010101010...)s. (8) 


This process may clearly be made rigorous if we replace the artificial infinite 
number ...11111.11111... by a number with suitably many ones. 
The balanced ternary number system has many pleasant properties: 


a) The negative of a number is obtained by interchanging 1 and 1. 


b) The sign of a number is given by its most significant nonzero trit, and in 
general we can compare any two numbers by reading them from left to right 
and using lexicographic order, as in the decimal system. 


c) The operation of rounding to the nearest integer is identical to truncation; 
in other words, we simply delete everything to the right of the radix point. 


Addition in the balanced ternary system is quite simple, using the table 


Tiitiiiiiiid¢0C@d0d0DODOOOOOTITILtiitiiii 
Tiitioooili1iiiioddotztrz1iigétrpeiitiodoodiidi 
TO1LTTOLTTOLTOILTOLITOLIOILITOLIOL 
1011 111 1010411 10104310111 1010 11] #11110 


(The three inputs to the addition are the digits of the numbers to be added and 
the carry digit.) Subtraction is negation followed by addition. Multiplication 
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also reduces to negation and addition, as in the following example: 


1101 {7 
x1 īo01ī [17] 
1101 
1101 
1101 


0111101 [289] 


Representation of numbers in the balanced ternary system is implicitly 
present in a famous mathematical puzzle, commonly called “Bachet’s problem 
of weights” — although it was already stated by Fibonacci four centuries before 
Bachet wrote his book, and by Tabari in Persia more than 100 years before Fi- 
bonacci. [See W. Ahrens, Mathematische Unterhaltungen und Spiele 1 (Leipzig: 
Teubner, 1910), Section 3.4; H. Hermelink, Janus 65 (1978), 105-117.] Positional 
number systems with negative digits were invented by J. Colson [Philos. Trans. 
34 (1726), 161-173], then forgotten and rediscovered about 100 years later by Sir 
John Leslie [The Philosophy of Arithmetic (Edinburgh: 1817); see pages 33-34, 
54, 64-65, 117, 150], and by A. Cauchy [Comptes Rendus Acad. Sci. 11 (Paris, 
1840), 789-798]. Cauchy pointed out that negative digits make it unnecessary for 
a person to memorize the multiplication table past 5x5. A claim that such num- 
ber systems were known in India long ago |J. Bharati, Vedic Mathematics (Delhi: 
Motilal Banarsidass, 1965)| has been refuted by K. S. Shukla [Mathematical 
Education 5,3 (1989), 129-133]. The first true appearance of “pure” balanced 
ternary notation was in an article by Léon Lalanne [Comptes Rendus Acad. 
Sci. 11 (Paris, 1840), 903-905], who was a designer of mechanical devices for 
arithmetic. Thomas Fowler independently invented and constructed a balanced 
ternary calculator at about the same time [see Report British Assoc. Adv. Sci. 
10 (1840), 55; 11 (1841), 39-40]. The balanced ternary number system was men- 
tioned only rarely for the next 100 years, until the development of the first elec- 
tronic computers at the Moore School of Electrical Engineering in 1945-1946; at 
that time it was given serious consideration as a possible replacement for the dec- 
imal system. The complexity of arithmetic circuitry for balanced ternary arith- 
metic is not much greater than it is for the binary system, and a given number 
requires only In 2/ ln 3 ~ 63% as many digit positions for its representation. Dis- 
cussions of the balanced ternary system appear in AMM 57 (1950), 90-93, and 
in High-speed Computing Devices, Engineering Research Associates (McGraw-— 
Hill, 1950), 287-289. The experimental Russian computer SETUN was based on 
balanced ternary notation [see CACM 3 (1960), 149-150], and perhaps the sym- 
metric properties and simple arithmetic of this number system will prove to be 
quite important someday — when the “flip-flop” is replaced by a “flip-flap-flop.” 

Positional notation generalizes in another important way to a mizxed-radix 
system. Given a sequence of numbers (bn) (where n may be negative), we define 


+++, 43,2, 41,40; G1, A_2,.-. 


sn b3, b2, b1, bo; b—1, b-2,... 
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= +++ + agb2b1bo + a2b1bo + a1bo + ao + a—1/b-1 + a—2/b-1b-2 +--+. (9) 


In the simplest mixed-radix systems, we work only with integers; we let bo, b1, 
bo, ... be integers greater than one, and deal only with numbers that have no 
radix point, where ap is required to lie in the range 0 < ay < by. 

One of the most important mixed-radix systems is the factorial number 
system, where bn = n+ 2. Using this system, which was known in 13th-century 
India, we can represent every positive integer uniquely in the form 


Cnn! + cn- (n— 1)! +--+ e2! +c, (10) 


where 0 < cp < k for 1 < k < n, and cn 40. (See Algorithms 3.3.2P and 3.4.2P.) 

Mixed-radix systems are familiar in everyday life, when we deal with units 
of measure. For example, the quantity “3 weeks, 2 days, 9 hours, 22 minutes, 57 
seconds, and 492 milliseconds” is equal to 


3, 2, 9, 22,57; 492 


7, 24, 60, 60; 1000 seconds. 


The quantity “10 pounds, 6 shillings, and thruppence ha’penny” was once equal 
to [i a z : 
purely decimal monetary system. 

It is possible to add and subtract mixed-radix numbers by using a straight- 
forward generalization of the usual addition and subtraction algorithms, provided 
of course that the same mixed-radix system is being used for both operands 
(see exercise 4.3.1-9). Similarly, we can easily multiply or divide a mixed-radix 
number by small integer constants, using simple extensions of the familiar pencil- 
and-paper methods. 

Mixed-radix systems were first discussed in full generality by Georg Cantor 
[Zeitschrift fiir Math. und Physik 14 (1869), 121-128]. Exercises 26 and 29 give 
further information about them. 


] pence in British currency, before Great Britain changed to a 


Several questions concerning irrational radices have been investigated by 
W. Parry, Acta Math. Acad. Sci. Hung. 11 (1960), 401-416. 

Besides the systems described in this section, several other ways to represent 
numbers are mentioned elsewhere in this series of books: the combinatorial num- 
ber system (exercise 1.2.6-56); the Fibonacci number system (exercises 1.2.8-34, 
5.4.2-10); the phi number system (exercise 1.2.8-35); modular representations 
(Section 4.3.2); Gray code (Section 7.2.1); and Roman numerals (Section 9.1). 


EXERCISES 
1. [15] Express —10, —9,..., 9, 10 in the number system whose radix is —2. 


2. [24] Consider the following four number systems: (a) binary (signed magnitude); 
(b) negabinary (radix —2); (c) balanced ternary; and (d) radix b = j5. Use each of 
these four number systems to express each of the following three numbers: (i) —49; 


(ii) -34 (show the repeating cycle); (iii) m (to a few significant figures). 


3. [20] Express —49 + i in the quater-imaginary system. 
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4. [15] Assume that we have a MIX program in which location A contains a number 
for which the radix point lies between bytes 3 and 4, while location B contains a number 
whose radix point lies between bytes 2 and 3. (The leftmost byte is number 1.) Where 
will the radix point be, in registers A and X, after the following instructions? 


(a) LDA A; MUL B (b) LDA A; SRAX 5; DIV B 


5. [00] Explain why a negative integer in nines’ complement notation has a represen- 
tation in ten’s complement notation that is always one greater, if the representations 
are regarded as positive. 


6. [16] What are the largest and smallest p-bit integers that can be represented 
in (a) signed magnitude binary notation (including one bit for the sign), (b) two’s 
complement notation, (c) ones’ complement notation? 


7. [M20] The text defines ten’s complement notation only for integers represented 
in a single computer word. Is there a way to define a ten’s complement notation for all 
real numbers, having “infinite precision,” analogous to the text’s definition? Is there a 
similar way to define a nines’ complement notation for all real numbers? 


8. [M10] Prove Eq. (5). 


9. [15] Change the following octal numbers to hexadecimal notation, using the hexa- 
decimal digits 0, 1,..., 9, A, B, C, D, E, F: 12; 5655; 2550276; 76545336; 3726755. 


10. [M22] Generalize Eq. (5) to mixed-radix notation as in (9). 


11. [22] Design an algorithm that uses the —2 number system to compute the sum 
of (an . . . a1đa0)—2 and (bn ...b1bo)—2, obtaining the answer (cn+2...¢1C0)~—2. 


12. [23] Specify algorithms that convert (a) the binary signed magnitude number 
+(an ...ao)2 to its negabinary form (bn+2...bo)-2; and (b) the negabinary number 
(bn4+1.-.bo)—2 to its signed magnitude form +(an+41...@o)2. 


13. [M21] In the decimal system there are some numbers with two infinite decimal 
expansions; for example, 2.3599999... = 2.3600000.... Does the negadecimal (base 
—10) system have unique expansions, or are there real numbers with two different 
infinite expansions in this base also? 

14. [14] Multiply (11321)o; by itself in the quater-imaginary system using the method 
illustrated in the text. 

15. [M24] What are the sets S = {37,., akb™™ | a; an allowable digit}, analogous 
to Fig. 1, for the negative decimal and for the quater-imaginary number systems? 


16. [M24] Design an algorithm to add 1 to (an ...a1a0)i—1 in the i—1 number system. 


17. [M30] It may seem peculiar that i — 1 has been suggested as a number-system 
base, instead of the similar but intuitively simpler number i + 1. Can every complex 
number a+bi, where a and b are integers, be represented in a positional number system 
to base i+ 1, using only the digits 0 and 1? 


18. [HM32] Show that the twindragon of Fig. 1 is a closed set that contains a neighbor- 
hood of the origin. (Consequently, every complex number has a binary representation 
with radix i — 1.) 

19. [23] (David W. Matula.) Let D be a set of b integers, containing exactly one 
solution to the congruence x = j (modulo b) for 0 < j < b. Prove that all integers m 
(positive, negative, or zero) can be represented in the form m = (an...@o)b, where all 
the a; are in D, if and only if all integers in the range l < m < u can be so represented, 
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where l = — max{a | a E€ D}/(b—1) and u = — min{a | a € D}/(b — 1). For example, 
D = {—1,0,...,b— 2} satisfies the conditions for all b > 3. [Hint: Design an algorithm 
that constructs a suitable representation. ] 


20. [HM28] (David W. Matula.) Consider a decimal number system that uses the 
digits D = {—1,0,8, 17,26,35, 44, 53,62,71} instead of {0,1,...,9}. The result of 
exercise 19 implies (as in exercise 18) that all real numbers have an infinite decimal 
expansion using digits from D. 

In the usual decimal system, exercise 13 points out that some numbers have two 
representations. (a) Find a real number that has more than two D-decimal represen- 
tations. (b) Show that no real number has infinitely many D-decimal representations. 
(c) Show that uncountably many numbers have two or more D-decimal representations. 


21. [M22] (C. E. Shannon.) Can every real number (positive, negative, or zero) 
be expressed in a “balanced decimal” system, that is, in the form XO k<nak 10%, for 


some integer n and some sequence an, An—1, An—2, :.., Where each ax is one of the 
ten numbers { 44, 34, 25, 14, z, 5,15,25,35,45}? (Although zero is not one 
of the allowed digits, we implicitly assume that an41, an+2, ... are zero.) Find all 


representations of zero in this number system, and find all representations of unity. 


22. [HM25] Let a = — $ m>: 10-™. Given e > 0 and any real number zx, prove that 
there is a “decimal” representation such that 0 < |£ — X z_o ax10*| < e, where each ax 
is allowed to be only one of the three values 0, 1, or a. (No negative powers of 10 are 
used in this representation!) 


23. [HM30| Let D be a set of b real numbers such that every positive real number 
has a representation >> Ben axb* with all a, E€ D. Exercise 20 shows that there may 
be many numbers without unique representations; but prove that the set T of all such 
numbers has measure zero, if 0 € D. Show that this conclusion need not be true if 
0¢ D. 

24. [M35] Find infinitely many different sets D of ten nonnegative integers satisfying 
the following three conditions: (i) gcd(D) = 1; (ii) 0 € D; (iii) every positive real 
number can be represented in the form See an10” with all a, € D. 

25. [M25] (S. A. Cook.) Let b, u, and v be positive integers, where b > 2 and 
0 < v < b™. Show that the radix-b representation of u/v does not contain a run of 
m consecutive digits equal to b — 1, anywhere to the right of the radix point. (By 
convention, no runs of infinitely many (b — 1)’s are permitted in the standard radix-b 
representation. ) 


26. [HM30] (N.S. Mendelsohn.) Let (Bn) be a sequence of real numbers defined for 
all integers n, —oo < n < oo, such that 


Bn < Bn413 lim Bn = œ; lim Bn =0. 
noo n—+—oo 


Let (cn) be an arbitrary sequence of positive integers that is defined for all integers n, 
—oo < n < oo. Let us say that a number x has a “generalized representation” if 
there is an integer n and an infinite sequence of integers an, Gn—1, Gn—2, ... Such that 
j= Zen akbk, where an #0, 0 < ak < ck, and ak < Cp for infinitely many k. 
Show that every positive real number x has exactly one generalized representation 
if and only if 
Bn+1 = 5 Ck bk for all n. 


k<n 


212 ARITHMETIC 4.1 


(Consequently, the mixed-radix systems with integer bases all have this property; and 
mixed-radix systems with 31 = (co+1)80, 82 = (c1+1)(co+1)8o, ..., 8-1 = Bo/(ce-1+ 
1), ... are the most general number systems of this type.) 


27. [M21] Show that every nonzero integer has a unique “reversing binary representa- 
tion” 

920 = 9e1 giire (—1)'2%, 
where eo < € <=- < et. 
28. [M24] Show that every nonzero complex number of the form a +bi where a and b 
are integers has a unique “revolving binary representation” 


(142) HiHi AHi ill H HAH), 


where eo < e1 <--- < er. (Compare with exercise 27.) 


29. [M35] (N. G. de Bruijn.) Let So, S1, S2, ... be sets of nonnegative integers; 
we will say that the collection {So,.51,S2,...} has Property B if every nonnegative 
integer n can be written in the form 


n = so + 51+ S2+°:::, sj € Sj, 


in exactly one way. (Property B implies that 0 € S; for all j, since n = 0 can only 
be represented as 0+0+0+---.) Any mixed-radix number system with radices bo, 
bi, b2, ... provides an example of a collection of sets satisfying Property B, if we let 
S; = {0, B;,...,(b; — 1)B;}, where Bj = bob, ...b;-1; here the representation of n = 
80+81+582+:-- corresponds in an obvious manner to its mixed-radix representation (9). 
Furthermore, if the collection {So,.$1,S2,...} has Property B, and if Ao, Ai, A2, ... 
is any partition of the nonnegative integers (so that we have Ao U Ay U Ag U--- = 
{0,1,2,...} and Ai N A; = 9 for i Æ j; some A;’s may be empty), then the “collapsed” 
collection {To, T1, T2,...} also has Property B, where T} is the set of all sums }7,. Ay 5i 
taken over all possible choices of s; € Sj. 

Prove that any collection {To, Tı, T2,...} that satisfies Property B may be obtained 
by collapsing some collection {So, $1, S2,...} that corresponds to a mixed-radix number 
system. 


30. [M39] (N. G. de Bruijn.) The negabinary number system shows us that every 
integer (positive, negative, or zero) has a unique representation of the form 


(2) + (2) et (2), et >e> See SO 450, 


The purpose of this exercise is to explore generalizations of this phenomenon. 


a) Let bo, b1, b2, ... be a sequence of integers such that every integer n has a unique 
representation of the form 


n = be, + beg +--+ + bez, e1 >e2 > +: > er > 0, t>0. 

(Such a sequence (bn) is called a “binary basis.”) Show that there is an index j 
such that b; is odd, but by, is even for all k Æ j. 

b) Prove that a binary basis (b,) can always be rearranged into the form do, 2d1, 
4d2, ... = (2"dn), where each dp is odd. 

c) If each of do, di, d2, ... in (b) is +1, prove that (bn) is a binary basis if and only 
if there are infinitely many +1’s and infinitely many —1’s. 

d) Prove that 7, —13-2, 7-27, —13-23,..., 7-27", -13-2?**1 _.. is a binary basis, 
and find the representation of n = 1. 
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> 31. [M35] A generalization of two’s complement arithmetic, called “2-adic numbers,” 
was introduced by K. Hensel in Crelle 127 (1904), 51-84. (In fact he treated p-adic 
numbers, for any prime p.) A 2-adic number may be regarded as a binary number 


u = ( ».UZU2U1U0-.U-1... U—n)2, 


whose representation extends infinitely far to the left of the binary point, but only 
finitely many places to the right. Addition, subtraction, and multiplication of 2-adic 
numbers are done according to the ordinary procedures of arithmetic, which can in 
principle be extended indefinitely to the left. For example, 


7 = (...000000000000111). z = (...110110110110111)2 
—7 = (...111111111111001)2 —1 = (...001001001001001)2 
T = (...000000000000001.11)2 75 = (.--110011001100110.1)2 


V—7 =(...100000010110101)2 or (...011111101001011)2. 


Here 7 appears as the ordinary binary integer seven, while —7 is its two’s comple- 
ment (extending infinitely to the left); it is easy to verify that the ordinary procedure 
for addition of binary numbers will give —7 +7 = (...00000)2 = 0, when the procedure 
is continued indefinitely. The values of z and -4 are the unique 2-adic numbers that, 
when formally multiplied by 7, give 1 and —1, respectively. The values of 7 and if 
are examples of 2-adic numbers that are not 2-adic “integers,” since they have nonzero 
bits to the right of the binary point. The two values of \/—7, which are negatives of 
each other, are the only 2-adic numbers that, when formally squared, yield the value 
(...111111111111001)2. 


a) Prove that any 2-adic number u can be divided by any nonzero 2-adic number v 
to obtain a unique 2-adic number w satisfying u = vw. (Hence the set of 2-adic 
numbers forms a “field”; see Section 4.6.1.) 


b) Prove that the 2-adic representation of the rational number —1/(2n + 1) may be 
obtained as follows, when n is a positive integer: First find the ordinary binary 
expansion of +1/(2n+1), which has the periodic form (0.aaq@...)2 for some string 
a of Os and 1s. Then —1/(2n + 1) is the 2-adic number (...aaa)2. 


c) Prove that the representation of a 2-adic number u is ultimately periodic (that is, 
uN+ = un for all large N, for some > 1) if and only if u is rational (that is, 
u = m/n, for some integers m and n). 

d) Prove that, when n is an integer, y/n is a 2-adic number if and only if it satisfies 
n mod 27*+% — 2?* for some nonnegative integer k. (Thus, the possibilities are 
either n mod 8 = 1, or n mod 32 = 4, etc.) 


32. [M40] (I. Z. Ruzsa.) Construct infinitely many integers whose ternary represen- 
tation uses only Os and 1s and whose quinary representation uses only Os, 1s, and 2s. 


33. [M40] (D. A. Klarner.) Let D be any set of integers, let b be any positive integer, 
and let kn be the number of distinct integers that can be written as n-digit numbers 
(an-ı... a1ao)b to base b with digits a; in D. Prove that the sequence (kn) satisfies 
a linear recurrence relation, and explain how to compute the generating function 
Xn knz”. Ilustrate your algorithm by showing that kn is a Fibonacci number in 
the case b = 3 and D = {—1, 0,3}. 

> 34. [22] (G. W. Reitwiesner, 1960.) Explain how to represent a given integer n in the 
form (...a2a1ao)2, where each a; is —1, 0, or 1, using the fewest nonzero digits. 
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4.2. FLOATING POINT ARITHMETIC 


IN THIS SECTION we shall study the basic principles of arithmetic operations on 
“floating point” numbers, by analyzing the internal mechanisms underlying such 
calculations. Perhaps many readers will have little interest in this subject, since 
their computers either have built-in floating point instructions or their operating 
systems include suitable subroutines. But, in fact, the material of this section 
should not merely be the concern of computer-design engineers or of a small 
clique of people who write library subroutines for new machines; every well- 
rounded programmer ought to have a knowledge of what goes on during the ele- 
mentary steps of floating point arithmetic. This subject is not at all as trivial as 
most people think, and it involves a surprising amount of interesting information. 


4.2.1. Single-Precision Calculations 


A. Floating point notation. We have discussed “fixed point” notation for 
numbers in Section 4.1; in such a case the programmer knows where the radix 
point is assumed to lie in the numbers being manipulated. For many purposes, 
however, it is considerably more convenient to let the position of the radix point 
be dynamically variable or “floating” as a program is running, and to carry with 
each number an indication of its current radix point position. This idea has been 
used for many years in scientific calculations, especially for expressing very large 
numbers like Avogadro’s number N = 6.02214 x 1073, or very small numbers like 
Planck’s constant h = 6.6261 x 10-2" erg sec. 

In this section we shall work with base b, excess q, floating point numbers 
with p digits: Such numbers will be represented by pairs of values (e, f), denoting 


(e, f) = f x oe" (1) 

Here e is an integer having a specified range, and f is a signed fraction. We will 
adopt the convention that 

fl <1; 

in other words, the radix point appears at the left of the positional representation 

of f. More precisely, the stipulation that we have p-digit numbers means that 
b? f is an integer, and that 

—b? < bP f < bP. (2) 

The term “floating binary” implies that b = 2, “floating decimal” implies b = 10, 

etc. Using excess-50 floating decimal numbers with 8 digits, we can write, for 
example, 

Avogadro’s number N = (74,+.60221400); 

Planck’s constant h = (24, +.66261000). (3) 


The two components e and f of a floating point number are called the 
exponent and the fraction parts, respectively. (Other names are occasionally 
used for this purpose, notably “characteristic” and “mantissa”; but it is an abuse 
of terminology to call the fraction part a mantissa, since that term has quite a 
different meaning in connection with logarithms. Furthermore the English word 
mantissa means “a worthless addition.”) 
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The MIX computer assumes that its floating point numbers have the form 


== e f f f f : (4) 


Here we have base b, excess q, floating point notation with four bytes of precision, 
where b is the byte size (e.g., b = 64 or b = 100), and q is equal to |4b]. 
The fraction part is + f f f f, and e is the exponent, which lies in the range 
0 < e< b. This internal representation is typical of the conventions in most 
existing computers, although b is a much larger base than usual. 


B. Normalized calculations. A floating point number (e, f) is normalized if 
the most significant digit of the representation of f is nonzero, so that 


1/b < |f| <1; (5) 


or if f = 0 and e has its smallest possible value. It is possible to tell which of 
two normalized floating point numbers has a greater magnitude by comparing 
the exponent parts first, and then testing the fraction parts only if the exponents 
are equal. 

Most floating point routines now in use deal almost entirely with normalized 
numbers: Inputs to the routines are assumed to be normalized, and the outputs 
are always normalized. Under these conventions we lose the ability to represent 
a few numbers of very small magnitude — for example, the value (0, .00000001) 
can’t be normalized without producing a negative exponent — but we gain in 
speed, uniformity, and the ability to give relatively simple bounds on the relative 
error in our computations. (Unnormalized floating point arithmetic is discussed 
in Section 4.2.2.) 

Let us now study the normalized floating point operations in detail. At the 
same time we can consider the construction of subroutines for these operations, 
assuming that we have a computer without built-in floating point hardware. 

Machine-language subroutines for floating point arithmetic are usually writ- 
ten in a very machine-dependent manner, using many of the wildest idiosyn- 
crasies of the computer at hand. Therefore floating point addition subroutines 
for two different machines usually bear little superficial resemblance to each 
other. Yet a careful study of numerous subroutines for both binary and decimal 
computers reveals that these programs actually have quite a lot in common, and 
it is possible to discuss the topics in a machine-independent way. 

The first (and by far the most difficult!) algorithm we shall discuss in this 
section is a procedure for floating point addition, 


(eu, fu) ® (ev, fo) = (ew, fw). (6) 
Since floating point arithmetic is inherently approximate, not exact, we will use 
“round” symbols 
8, 9, ®, © 
to stand for floating point addition, subtraction, multiplication, and division, 
respectively, in order to distinguish approximate operations from the true ones. 
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Vv 
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eu 2 Eev +p+2 


| 


Fig. 2. Floating point addition. 


The basic idea involved in floating point addition is fairly simple: Assuming 


that eu > ey, we take ew = ĉu, fw = fu + fu/b™—™ (thereby aligning the 
radix points for a meaningful addition), and normalize the result. But several 
situations can arise that make this process nontrivial, and the following algorithm 
explains the method more precisely. 


Algorithm A (Floating point addition). Given base b, excess q, p-digit, nor- 
malized floating point numbers u = (eu, fu) and v = (ev, fy), this algorithm 
forms the sum w = u@v. The same procedure may be used for floating point 
subtraction, if —v is substituted for v. 


Al. 


A2. 


A3. 
A4. 


A5. 


A6. 


Unpack.] Separate the exponent and fraction parts of the representations 
of u and v. 


Assume Eeu > €y.| If eu < ey, interchange u and v. (In many cases, it is 
best to combine step A2 with step A1 or with some of the later steps.) 


Set ew.| Set ew < eu. 


Test eu — €v] If €u — ev > p+ 2 (large difference in exponents), set fw << fu 
and go to step A7. (Actually, since we are assuming that u is normalized, 
we could terminate the algorithm; but it is occasionally useful to be able to 
normalize a possibly unnormalized number by adding zero to it.) 


[Scale right.] Shift fy to the right eu — e, places; that is, divide it by b°—®. 
(Note: This will be a shift of up to p+ 1 places, and the next step (which 
adds f„ to f,) thereby requires an accumulator capable of holding 2p + 1 
base-b digits to the right of the radix point. If such a large accumulator 
is not available, it is possible to shorten the requirement to p+ 2 or p+3 
places if proper precautions are taken; the details are given in exercise 5.] 


[Add.] Set fu < fu + fo. 
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Rounding overflow 


|f|=1 N4. Scale right 
g 
N1. Test f N5. Round 
Yes 
N2. Is f normalized? 
f=0 

No N6. Check e > 
Overflow or 
N3. Scale left underflow 

N7. Pack 

yv 


AT. 


Fig. 3. Normalization of (e, f). 


[Normalize.] (At this point (ew, fw) represents the sum of u and v, but | ful 
may have more than p digits, and it may be greater than unity or less than 
1/b.) Perform Algorithm N below, to normalize and round (ew, fw) into the 
final answer. J 


Algorithm N (Normalization). A “raw exponent” e and a “raw fraction” f are 
converted to normalized form, rounding if necessary to p digits. This algorithm 
assumes that |f| < b. 


N6. 


N7. 


N1. 


N2. 
N3. 


Test f.] If |f| > 1 (“fraction overflow”), go to step N4. If f = 0, set e to 
its lowest possible value and go to step N7. 


Is f normalized?] If |f| > 1/b, go to step N5. 
Scale left.] Shift f to the left by one digit position (that is, multiply it 
by b), and decrease e by 1. Return to step N2. 


. [Scale right.] Shift f to the right by one digit position (that is, divide it 


by b), and increase e by 1. 


. [Round.] Round f to p places. (We take this to mean that f is changed to 


the nearest multiple of b~?. It is possible that (b? f) mod 1 = 4 so that there 
are two nearest multiples; if b is even, we change f to the nearest multiple 
f’ of bP such that bP f' + ab is odd. Further discussion of rounding appears 
in Section 4.2.2.) It is important to note that this rounding operation can 
make |f| = 1 (“rounding overflow”); in such a case, return to step N4. 
[Check e.] If e is too large, that is, greater than its allowed range, an 
exponent overflow condition is sensed. If e is too small, an exponent under- 
flow condition is sensed. (See the discussion below; since the result cannot 
be expressed as a normalized floating point number in the required range, 
special action is necessary.) 

[Pack.] Put e and f together into the desired output representation. J 


Some simple examples of floating point addition are given in exercise 4. 
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The following MIX subroutines, for addition and subtraction of numbers 
having the form (4), show how Algorithms A and N can be expressed as computer 
programs. The subroutines below are designed to take one input u from symbolic 
location ACC, and the other input v comes from register A upon entrance to the 
subroutine. The output w appears both in register A and location ACC. Thus, a 
fixed point coding sequence 


LDA A; ADD B; SUB C; STAD (7) 
would correspond to the floating point coding sequence 


LDA A, STA ACC; LDA B, JMP FADD; LDA C, JMP FSUB; STAD. (8) 


Program A (Addition, subtraction, and normalization). The following program 
is a subroutine for Algorithm A, and it is also designed so that the normalization 
portion can be used by other subroutines that appear later in this section. In 
this program and in many others throughout this chapter, OFLO stands for a 
subroutine that prints out a message to the effect that MIX’s overflow toggle 
was unexpectedly found to be on. The byte size b is assumed to be a multiple 
of 4. The normalization routine NORM assumes that rl2 = e and rAX = f, where 
rA = 0 implies rX = 0 and rl2 < b. 


00 BYTE EQU 1(4:4) Byte size b 

01 EXP EQU 1:1 Definition of exponent field 

02 FSUB STA TEMP Floating point subtraction subroutine: 
03 LDAN TEMP Change sign of operand. 

04 FADD STJ EXITF Floating point addition subroutine: 
05 JOV OFLO Ensure that overflow is off. 

06 STA TEMP TEMP & v. 

07 LDX ACC rX € u. 

08 CMPA ACC(EXP) Steps Al, A2, A3 are combined here: 
09 JGE iF Jump if ev > eu. 

10 STX FU(0:4) Fu<itfff fo. 

11 LD2 ACC(EXP) rI2 + ey. 

12 STA FV(0:4) 

13 LD1N TEMP (EXP) rll + —e,. 

14 JMP 4F 

15 1H STA FU(0:4) FU + + f f f fO (u,v interchanged). 
16 LD2 TEMP (EXP) rI2 + ey. 

17 STX FV(0:4) 

18 LD1N ACC (EXP) rll + —e,. 

19 4H INC1 0,2 rll + eu — ey. (Step A4 unnecessary.) 
20 5H LDA FV Ad. Scale right. 

21 ENTX 0 Clear rX. 

22 SRAX 0,1 Shift right eu — e» places. 

23 6H ADD FU A6. Add. 

24 JOV N4 A7. Normalize. Jump if fraction overflow. 
25 JXZ NORM Easy case? 

26 LD1 FV(0:1) Check for opposite signs. 


27 JAP 1F 
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28 JAN 
29 JMP 
30 1H J1P 
31 2H SRC 
32 DECX 
33 STA 
34 STA 
35 LDAN 
36 ADD 
37 ADD 
38 SRC 
39 JMP 
40 HALF CON 
41 FU CON 
42 FV CON 
43 NORM JAZ 
44 N2  CMPA 
45 JNE 
46 N3  SLAX 
41 DEC2 
48 JMP 
49 N4 ENTX 
50 SRC 
51 INC2 
52 N5  CMPA 
53 JL 
54 JG 
55 JXNZ 
56 STA 
57 LDX 
58 JXO 
59 5H STA 
60 INCA 
61 JOV 
62 N6 J2N 
63 N7  ENTX 
64 SRC 
65 ZRO DEC2 
66 8H STA 
67 EXITF J2N 
68 EXPOV HLT 
69 EXPUN HLT 
70 ACC CON 


TEMP 
HALF (0:0) 
TEMP 
HALF 
HALF 

5 

N2 

1//2 


=BYTE/2=(5:5) 
N6 

5F 

5F 

TEMP 

TEMP (4:4) 
N6 
*+1(0:0) 
BYTE 

N4 

EXPUN 
0,2 
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If not, normalize. 


|rX| © |rAl. 
(rX is positive.) 
(The operands had opposite signs; 
we must adjust the registers 
before rounding and normalization.) 


Complement the least significant portion. 
Jump into normalization routine. 


One half the word size (Sign varies) 
Fraction part fu 

Fraction part fy 

N1. Test f. 

N2. Is f normalized? 

To N5 if leading byte nonzero. 

N3. Scale left. 

Decrease e by 1. 

Return to N2. 

N4. Scale right. 

Shift right, insert “1” with proper sign. 
Increase e by 1. 

N5. Round. 

Is |tail| < 5b? 


Is |tail| > 5b? 
tail] = b; round to odd. 


To N6 if rX is odd. 

Store sign of rA. 

Add b~* to |f|. (Sign varies) 
Check for rounding overflow. 

N6. Check e. Underflow if e < 0. 
N7. Pack. rX © e. 


rl2¢ e— b. 


Exit, unless e > b. 

Exponent overflow detected 
Exponent underflow detected 
Floating point accumulator J 


The rather long section of code from lines 26 to 40 is needed because MIX has 
only a 5-byte accumulator for adding signed numbers while in general 2p+1 = 9 
places of accuracy are required by Algorithm A. The program could be shortened 
to about half its present length if we were willing to sacrifice a little bit of its 
accuracy, but we shall see in the next section that full accuracy is important. 
Line 58 uses a nonstandard MIX instruction defined in Section 4.5.2. The running 
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time for floating point addition and subtraction depends on several factors that 
are analyzed in Section 4.2.4. 

Now let us consider multiplication and division, which are simpler than 
addition, and somewhat similar to each other. 


Algorithm M (Floating point multiplication or division). Given base b, excess q, 

p-digit, normalized floating point numbers u = (eu, fu) and v = (ev, fy), this 

algorithm forms the product w = u® v or the quotient w = u © v. 

M1. [Unpack.] Separate the exponent and fraction parts of the representations 
of u and v. (Sometimes it is convenient, but not necessary, to test the 
operands for zero during this step.) 


M2. [Operate.] Set 
ew + Cy + ey — q, fu < fu fo for multiplication; 


Cw + €y—eytqt 1, fw | (bt fu)/fo for division. (9) 


(Since the input numbers are assumed to be normalized, it follows that 
either fw = 0, or 1/b? < | fw| < 1, or a division-by-zero error has occurred.) 
If necessary, the representation of fọ may be reduced to p+2 or p+3 digits 
at this point, as in exercise 5. 

M3. [Normalize.| Perform Algorithm N on (ew, fw) to normalize, round, and 
pack the result. (Note: Normalization is simpler in this case, since scaling 
left occurs at most once, and since rounding overflow cannot occur after 
division.) | 


The following MIX subroutines, intended to be used in connection with 
Program A, illustrate the machine considerations that arise in Algorithm M. 


Program M (Floating point multiplication and division). 


01 Q EQU BYTE/2 q is half the byte size 

02 FMUL STJ EXITF Floating point multiplication subroutine: 
03 JOV OFLO Ensure that overflow is off. 

04 STA TEMP TEMP < v. 

05 LDX ACC rX + u. 

06 STX FU(0:4) FuUctffffo. 

07 LD1 TEMP (EXP) 

08 LD2 ACC(EXP) 

09 INC2 -Q,1 rIĪ2 <4 eu +e, — q. 

10 SLA 1 

11 MUL FU Multiply fu times fv. 

12 JMP NORM Normalize, round, and exit. 

13 FDIV STJ EXITF Floating point division subroutine: 
14 JOV OFLO Ensure that overflow is off. 

15 STA TEMP TEMP < v. 

16 STA FV(0:4) FV itffffo. 

17 LD1 TEMP(EXP) 

18 LD2 ACC(EXP) 


19 DEC2 -Q,1 rl2¢+ eu — & + q. 
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20 ENTX O0 

21 LDA ACC 

22 SLA 1 rA & fu. 

23 CMPA FV(1:5) 

24 JL *+3 Jump if | ful < | fol. 

25 SRA 1 Otherwise, scale fu right 

26 INC2 1 and increase rI2 by 1. 

27 DIV FV Divide. 

28 JNOV NORM Normalize, round, and exit. 

29 DVZRO HLT 3 Unnormalized or zero divisor Į 


The most noteworthy feature of this program is the provision for division 
in lines 23-26, which is made in order to ensure enough accuracy to round the 
answer. If | ful < |fu|, straightforward application of Algorithm M would leave 
a result of the form “+0f f f f” in register A, and this would not allow a 
proper rounding without a careful analysis of the remainder (which appears in 
register X). So the program computes fw + fu/fv in this case, ensuring that fw 
is either zero or normalized in all cases; rounding can proceed with five significant 
bytes, possibly testing whether the remainder is zero. 


We occasionally need to convert values between fixed and floating point 
representations. A “fix-to-float” routine is easily obtained with the help of the 
normalization algorithm above; for example, in MIX, the following subroutine 
converts an integer to floating point form: 


01 FLOT STJ EXITF Assume that rA = u, an integer. 


02 JOV OFLO Ensure that overflow is off. 

03 ENT2 Q+5 Set raw exponent. (10) 
04 ENTX O 

05 JMP NORM Normalize, round, and exit. J 


A “float-to-fix” subroutine is the subject of exercise 14. 


The debugging of floating point subroutines is usually a difficult job, since 
there are so many cases to consider. Here is a list of common pitfalls that often 
trap a programmer or machine designer who is preparing floating point routines: 


1) Losing the sign. On many machines (not MIX), shift instructions between 
registers will affect the sign, and the shifting operations used in normalizing and 
scaling numbers must be carefully analyzed. The sign is also lost frequently 
when minus zero is present. (For example, Program A is careful to retain the 
sign of register A in lines 33-37. See also exercise 6.) 


2) Failure to treat exponent underflow or overflow properly. The size of ew 
should not be checked until after the rounding and normalization, because 
preliminary tests may give an erroneous indication. Exponent underflow and 
overflow can occur on floating point addition and subtraction, not only during 
multiplication and division; and even though this is a rather rare occurrence, it 
must be tested each time. Enough information should be retained so that mean- 
ingful corrective actions are possible after overflow or underflow has occurred. 


222 ARITHMETIC 4.2.1 


It has unfortunately become customary in many instances to ignore exponent 
underflow and simply to set underflowed results to zero with no indication of 
error. This causes a serious loss of accuracy in most cases (indeed, it is the 
loss of all the significant digits), and the assumptions underlying floating point 
arithmetic have broken down; so the programmer really must be told when 
underflow has occurred. Setting the result to zero is appropriate only in certain 
cases when the result is later to be added to a significantly larger quantity. 
When exponent underflow is not detected, we find mysterious situations in which 
(u&v)&u is zero, but u®(v@w) is not, since u®v results in exponent underflow 
but u & (v @ w) can be calculated without any exponents falling out of range. 
Similarly, we can find positive numbers a, b, c, d, and y such that 


(a@y ® d|O(c@y @d) x 


(a 6 bO@y) O(c 6 doy) = (11) 


= oN 


if exponent underflow is not detected. (See exercise 9.) Even though floating 
point routines are not precisely accurate, such a disparity as (11) is certainly 
unexpected when a, b, c, d, and y are all positive! Exponent underflow is usually 
not anticipated by a programmer, so it needs to be reported.* 


3) Inserted garbage. When scaling to the left it is important to keep from 
introducing anything but zeros at the right. For example, note the ‘ENTX 0’ 
instruction in line 21 of Program A, and the all-too-easily-forgotten ‘ENTX 0’ 
instruction in line 04 of the FLOT subroutine (10). (But it would be a mistake to 
clear register X after line 27 in the division subroutine.) 


4) Unforeseen rounding overflow. When a number like .999999997 is rounded 
to 8 digits, a carry will occur to the left of the decimal point, and the result must 
be scaled to the right. Many people have mistakenly concluded that rounding 
overflow is impossible during multiplication, since they look at the maximum 
value of |fuf,|, which is 1 — 2b-” + b~??; and this cannot round up to 1. The 
fallacy in this reasoning is exhibited in exercise 11. Curiously, it turns out that 
the phenomenon of rounding overflow is impossible during floating point division 
(see exercise 12). 


* On the other hand, we must admit that today’s high-level programming languages give the 
programmer little or no satisfactory way to make use of the information that a floating point 
routine wants to provide; and the MIX programs in this section, which simply halt when errors 
are detected, are even worse. There are numerous important applications in which exponent 
underflow is relatively harmless, and it is desirable to find a way for programmers to cope 
with such situations easily and safely. The practice of silently replacing underflows by zero has 
been thoroughly discredited, but there is another alternative that has recently been gaining 
much favor, namely to modify the definition that we have given for floating point numbers, 
allowing an unnormalized fraction part when the exponent has its smallest possible value. This 
idea of “gradual underflow,” which was first embodied in the hardware of the Electrologica X8 
computer, adds only a small amount of complexity to the algorithms, and it makes exponent 
underflow impossible during addition or subtraction. The simple formulas for relative error in 
Section 4.2.2 no longer hold in the presence of gradual underflow, so the topic is beyond the 
scope of this book. However, by using formulas like round(x) = x(1—6)+e, where || < b!~?/2 
and |e| < b-?~4/2, one can show that gradual underflow succeeds in many important cases. 
See W. M. Kahan and J. Palmer, ACM SIGNUM Newsletter (October 1979), 13-21. 
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There is a school of thought that says it is harmless to “round” a value like 
.999999997 to .99999999 instead of to 1.0000000, since this does not increase 
the worst-case bounds on relative error. The floating decimal number 1.0000000 
may be said to represent all real values in the interval 


[1.0000000 — 5 x 1078 .. 1.0000000 + 5 x 107°], 
while .99999999 represents all values in the much smaller interval 


(.99999999 — 5 x 107° .. 99999999 + 5 x 107°). 


Even though the latter interval does not contain the original value .999999997, 
each number of the second interval is contained in the first, so subsequent 
calculations with the second interval are no less accurate than with the first. This 
ingenious argument is, however, incompatible with the mathematical philosophy 
of floating point arithmetic expressed in Section 4.2.2. 


5) Rounding before normalizing. Inaccuracies are caused by premature round- 
ing in the wrong digit position. This error is obvious when rounding is being done 
to the left of the appropriate position; but it is also dangerous in the less obvious 
cases where rounding is first done too far to the right, followed by rounding in the 
true position. For this reason it is a mistake to round during the “scaling-right” 
operation in step A5, except as prescribed in exercise 5. (The special case of 
rounding in step N5, then rounding again after rounding overflow has occurred, 
is harmless, however, because rounding overflow always yields +1.0000000 and 
such values are unaffected by the subsequent rounding process.) 


6) Failure to retain enough precision in intermediate calculations. Detailed 
analyses of the accuracy of floating point arithmetic, made in the next section, 
suggest strongly that normalizing floating point routines should always deliver 
a properly rounded result to the maximum possible accuracy. There should 
be no exceptions to this dictum, even in cases that occur with extremely low 
probability; the appropriate number of significant digits should be retained 
throughout the computations, as stated in Algorithms A and M. 


C. Floating point hardware. Nearly every large computer intended for 
scientific calculations includes floating point arithmetic as part of its repertoire of 
built-in operations. Unfortunately, the design of such hardware usually includes 
some anomalies that result in dismally poor behavior in certain circumstances, 
and we hope that future computer designers will pay more attention to providing 
the proper behavior than they have in the past. It costs only a little more 
to build the machine right, and considerations in the following section show 
that substantial benefits will be gained. Yesterday’s compromises are no longer 
appropriate for modern machines, based on what we know now. 

The MIX computer, which is being used as an example of a “typical” machine 
in this series of books, has an optional “floating point attachment” (available at 
extra cost) that includes the following seven operations: 


e FADD, FSUB, FMUL, FDIV, FLOT, FCMP (C = 1, 2, 3, 4, 5, 56, respectively; F = 6). 
The contents of rA after the operation ‘FADD V’ are precisely the same as the 
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contents of rA after the operations 
STA ACC; LDA V; JMP FADD 


where FADD is the subroutine that appears earlier in this section, except that both 
operands are automatically normalized before entry to the subroutine if they 
were not already in normalized form. (If exponent underflow occurs during this 
pre-normalization, but not during the normalization of the answer, no underflow 
is signalled.) Similar remarks apply to FSUB, FMUL, and FDIV. The contents of 
rA after the operation ‘FLOT’ are the contents after ‘JMP FLOT’ in the subroutine 
(10) above. 

The contents of rA are unchanged by the operation ‘FCMP V’. This instruc- 
tion sets the comparison indicator to LESS, EQUAL, or GREATER, depending on 
whether the contents of rA are “definitely less than,” “approximately equal to,” 
or “definitely greater than” V, as discussed in the next section. The precise 
action is defined by the subroutine FCMP of exercise 4.2.2-17 with EPSILON in 
location 0. 

No register other than rA is affected by any of the floating point operations. 
If exponent overflow or underflow occurs, the overflow toggle is turned on and 
the exponent of the answer is given modulo the byte size. Division by zero leaves 
undefined garbage in rA. Execution times: 4u, 4u, 9u, 11u, 3u, 4u, respectively. 


e FIX (C = 5; F = 7). The contents of rA are replaced by the integer “round(rA)”, 
rounding to the nearest integer as in step N5 of Algorithm N. However, if this 
answer is too large to fit in the register, the overflow toggle is set on and the 
result is undefined. Execution time: 3u. 


Sometimes it is helpful to use floating point operators in a nonstandard 
way. For example, if the operation FLOT had not been included as part of MIX’s 
floating point attachment, we could easily achieve its effect on 4-byte numbers 
by writing 


FLOT STJ 9F 
SLA 1 
ENTX Q+4 
SRC 1 (12) 
FADD =0= 

9H JMP * 4 


This routine is not strictly equivalent to the FLOT operator, since it assumes that 
the 1:1 byte of rA is zero, and it destroys rX. The handling of more general 
situations is a little tricky, because rounding overflow can occur even during a 
FLOT operation. 

Similarly, suppose MIX had a FADD operation but not FIX. If we wanted to 
round a number u from floating point form to the nearest fixed point integer, 
and if we knew that the number was nonnegative and would fit in at most three 
bytes, we could write 


FADD FUDGE 
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where location FUDGE contains the constant 


+ |Q+4| 1 | 0 |} O | O |; 


the result in rA would be 


+ [ua] 1 | round(u) |. (13) 


D. History and bibliography. The origins of floating point notation can 
be traced back to Babylonian mathematicians (1800 B.C. or earlier), who made 
extensive use of radix-60 floating point arithmetic but did not have a notation for 
the exponents. The appropriate exponent was always somehow “understood” by 
whoever was doing the calculations. At least one case has been found in which 
the wrong answer was given because addition was performed with improper 
alignment of the operands, but such examples are very rare; see O. Neugebauer, 
The Exact Sciences in Antiquity (Princeton, N. J.: Princeton University Press, 
1952), 26-27. Another early contribution to floating point notation is due to 
the Greek mathematician Apollonius (3rd century B.C.), who apparently was 
the first to explain how to simplify multiplication by collecting powers of 10 
separately from their coefficients, at least in simple cases. [For a discussion of 
Apollonius’s method, see Pappus, Mathematical Collections (4th century A.D.).] 
After the Babylonian civilization died out, the first significant uses of floating 
point notation for products and quotients did not emerge until much later, about 
the time logarithms were invented (1600) and shortly afterwards when Oughtred 
invented the slide rule (1630). The modern notation “«”” for exponents was 
being introduced at about the same time; separate symbols for x squared, x 
cubed, etc., had been in use before this. 

Floating point arithmetic was incorporated into the design of some of the ear- 
liest computers. It was independently proposed by Leonardo Torres y Quevedo 
in Madrid, 1914; by Konrad Zuse in Berlin, 1936; and by George Stibitz in 
New Jersey, 1939. Zuse’s machines used a floating binary representation that he 
called “semi-logarithmic notation”; he also incorporated conventions for dealing 
with special quantities like “oo” and “undefined.” The first American computers 
to operate with floating point arithmetic hardware were the Bell Laboratories’ 
Model V and the Harvard Mark II, both of which were relay calculators designed 
in 1944. [See B. Randell, The Origins of Digital Computers (Berlin: Springer, 
1973), 100, 155, 163-164, 259-260; Proc. Symp. Large-Scale Digital Calculating 
Machinery (Harvard, 1947), 41-68, 69-79; Datamation 13 (April 1967), 35-44 
(May 1967), 45-49; Zeit. für angew. Math. und Physik 1 (1950), 345-346.] 

The use of floating binary arithmetic was seriously considered in 1944-1946 
by researchers at the Moore School in their plans for the first electronic digital 
computers, but they found that floating point circuitry was much harder to 
implement with tubes than with relays. The group realized that scaling was a 
problem in programming; but they knew that it was only a very small part of a 
total programming job in those days. Indeed, explicit fixed-point scaling seemed 
to be well worth the time and trouble it took, since it tended to keep programmers 
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aware of the numerical accuracy they were getting. Furthermore, the machine de- 
signers argued that floating point representation would consume valuable mem- 
ory space, since the exponents must be stored; and they noted that floating point 
hardware was not readily adapted to multiple-precision calculations. [See von 
Neumann’s Collected Works 5 (New York: Macmillan, 1963), 43, 73-74.] At that 
time, of course, they were designing the first stored-program computer and the 
second electronic computer, and their choice had to be either fixed point or float- 
ing point arithmetic, not both. They anticipated the coding of floating binary 
subroutines, and in fact “shift left” and “shift right” instructions were put into 
their design primarily to make such routines more efficient. The first machine to 
have both kinds of arithmetic in its hardware was apparently a computer devel- 
oped at General Electric Company [see Proc. 2nd Symp. Large-Scale Digital Cal- 
culating Machinery (Cambridge, Mass.: Harvard University Press, 1951), 65-69]. 

Floating point subroutines and interpretive systems for early machines were 
coded by D. J. Wheeler and others, and the first publication of such routines 
was in The Preparation of Programs for an Electronic Digital Computer by 
Wilkes, Wheeler, and Gill (Reading, Mass.: Addison-Wesley, 1951), subroutines 
A1-A11, pages 35-37 and 105-117. It is interesting to note that floating decimal 
subroutines are described here, although a binary computer was being used; in 
other words, the numbers were represented as 10° f, not 2° f, and therefore the 
scaling operations required multiplication or division by 10. On this particular 
machine such decimal scaling was almost as easy as shifting, and the decimal 
approach greatly simplified input/output conversions. 

Most published references to the details of floating point arithmetic rou- 
tines are scattered in technical memorandums distributed by various computer 
manufacturers, but there have been occasional appearances of these routines in 
the open literature. Besides the reference above, the following are of historical 
interest: R. H. Stark and D. B. MacMillan, Math. Comp. 5 (1951), 86-92, 
where a plugboard-wired program is described; D. McCracken, Digital Computer 
Programming (New York: Wiley, 1957), 121-131; J. W. Carr II, CACM 2,5 
(May 1959), 10-15; W. G. Wadey, JACM 7 (1960), 129-139; D. E. Knuth, JACM 
8 (1961), 119-128; O. Kesner, CACM 5 (1962), 269-271; F. P. Brooks and K. E. 
Iverson, Automatic Data Processing (New York: Wiley, 1963), 184-199. For a 
discussion of floating point arithmetic from a computer designer’s standpoint, see 
“Floating point operation” by S. G. Campbell, in Planning a Computer System, 
edited by W. Buchholz (New York: McGraw-Hill, 1962), 92-121; A. Padegs, 
IBM Systems J. 7 (1968), 22-29. Additional references, which deal primarily 
with the accuracy of floating point methods, are given in Section 4.2.2. 

A revolutionary change in floating point hardware took place when most 
manufacturers began to adopt ANSI/IEEE Standard 754 during the late 1980s. 
Relevant references are: IEEE Micro 4 (1984), 86-100; W. J. Cody, Comp. 
Sci. and Statistics: Symp. on the Interface 15 (1983), 133-139; W. M. Kahan, 
Mini/Micro West-83 Conf. Record (1983), Paper 16/1; D. Goldberg, Computing 
Surveys 23 (1991), 5-48, 413; W. J. Cody and J. T. Coonen, ACM Trans. Math. 
Software 19 (1993), 443-451. 
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~ The MMIX computer, which will replace MIX in the next edition of this book, 
£ will naturally conform to the new standard. 


EXERCISES 


1. [10] How would Avogadro’s number and Planck’s constant (3) be represented in 
base 100, excess 50, four-digit floating point notation? (This would be the representa- 
tion used by MIX, as in (4), when the byte size is 100.) 

2. [12] Assume that the exponent e is constrained to lie in the range 0 < e < E; 
what are the largest and smallest positive values that can be written as base b, excess q, 
p-digit floating point numbers? What are the largest and smallest positive values that 
can be written as normalized floating point numbers with these specifications? 

3. [11] (K. Zuse, 1936.) Show that if we are using normalized floating binary 
arithmetic, there is a way to increase the precision slightly without loss of memory 
space: A p-bit fraction part can be represented using only p — 1 bit positions of a 
computer word, if the range of exponent values is decreased very slightly. 

> 4. [16] Assume that b = 10, p = 8. What result does Algorithm A give for 
(50, +.98765432) @ (49, +.33333333)? For (53, —.99987654) ® (54,+.10000000)? For 
(45, —.50000001) @ (54, +.10000000)? 

> 5. [24] Let us say that x ~ y (with respect to a given radix b) if x and y are real 
numbers satisfying the following conditions: 


[x/b| = Ly/b]; 
xmodb=0 <=> 
0<amodb< ib = > 0<ymodb< 4); 
a mod b = 4b => 


sb<amodb<b => 4b < ymodb < b. 


y mod b = 0; 


y mod b = ib; 


Prove that if f, is replaced by b~?~?F, between steps A5 and A6 of Algorithm A, where 
F, ~ b?*? fu, the result of that algorithm will be unchanged. (If F, is an integer and b is 
even, this operation essentially truncates f, to p+2 places while remembering whether 
any nonzero digits have been dropped, thereby minimizing the length of register that 
is needed for the addition in step A6.) 


6. [20] Ifthe result of a FADD instruction is zero, what will be the sign of rA, according 
to the definitions of MIX’s floating point attachment given in this section? 

7. [27] Discuss floating point arithmetic using balanced ternary notation. 

8. [20] Give examples of normalized eight-digit floating decimal numbers u and v 
for which addition yields (a) exponent underflow, (b) exponent overflow, assuming that 
exponents must satisfy 0 < e < 100. 


9. [M24] (W. M. Kahan.) Assume that the occurrence of exponent underflow causes 
the result to be replaced by zero, with no error indication given. Using excess zero, 
eight-digit floating decimal numbers with e in the range —50 < e < 50, find positive 
values of a, b, c, d, and y such that (11) holds. 


10. [12] Give an example of normalized eight-digit floating decimal numbers u and v 
for which rounding overflow occurs in addition. 


> 11. [M20] Give an example of normalized, excess 50, eight-digit floating decimal 
numbers u and v for which rounding overflow occurs in multiplication. 
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12. [M25] Prove that rounding overflow cannot occur during the normalization phase 
of floating point division. 

13. [380] When doing “interval arithmetic” we don’t want to round the results of a 
floating point computation; we want rather to implement operations such as Y and A, 
which give the tightest possible representable bounds on the true sum: 


uyv<utv<uAv. 


How should the algorithms of this section be modified for such a purpose? 


14. [25] Write a MIX subroutine that begins with an arbitrary floating point number 
in register A, not necessarily normalized, and converts it to the nearest fixed point 
integer (or determines that the number is too large in absolute value to make such a 
conversion possible). 


15. [28] Write a MIX subroutine, to be used in connection with the other subroutines 
of this section, that calculates u 1, namely u— |u| rounded to the nearest floating 
point number, given a floating point number u. Notice that when u is a very small 
negative number, u 1 should be rounded so that the result is unity (even though 
umod 1 has been defined to be always less than unity, as a real number). 


16. [HM21] (Robert L. Smith.) Design an algorithm to compute the real and imagi- 
nary parts of the complex number (a+ bi)/(c+ di), given real floating point values a, b, 
c, and d with c+di Æ 0. Avoid the computation of c? +d’, since it would cause floating 
point overflow even when |c| or |d| is approximately the square root of the maximum 
allowable floating point value. 


17. [40] (John Cocke.) Explore the idea of extending the range of floating point 
numbers by defining a single-word representation in which the precision of the fraction 
decreases as the magnitude of the exponent increases. 


18. [25] Consider a binary computer with 36-bit words, on which positive floating 
binary numbers are represented as (Oe 1e2...es fifo... f27)2; here (e1e2...eg)2 is an 
excess (10000000)2 exponent and (fife... f27)2 is a 27-bit fraction. Negative floating 
point numbers are represented by the two’s complement of the corresponding positive 
representation (see Section 4.1). Thus, 1.5 is 201|600000000 in octal notation, while 
—1.5 is 576| 200000000; the octal representations of 1.0 and —1.0 are 201|400000000 
and 576|400000000, respectively. (A vertical line is used here to show the boundary 
between exponent and fraction.) Note that bit fı of a normalized positive number is 
always 1, while it is almost always zero for negative numbers; the exceptional cases are 
representations of 2". 

Suppose that the exact result of a floating point operation has the octal code 
572|740000000|01; this (negative) 33-bit fraction must be normalized and rounded to 
27 bits. If we shift left until the leading fraction bit is zero, we get 576 |000000000 |20, 
but this rounds to the illegal value 576|000000000; we have over-normalized, since 
the correct answer is 575|400000000. On the other hand if we start (in some other 
problem) with the value 572|740000000|05 and stop before over-normalizing it, we get 
575 |400000000|50, which rounds to the unnormalized number 575|400000001; subse- 
quent normalization yields 576 |000000002 while the correct answer is 576|000000001. 

Give a simple, correct rounding rule that resolves this dilemma on such a machine 
(without abandoning two’s complement notation). 

19. [24] What is the running time for the FADD subroutine in Program A, in terms 
of relevant characteristics of the data? What is the maximum running time, over all 
inputs that do not cause exponent overflow or underflow? 
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Round numbers are always false. 
— SAMUEL JOHNSON (1750) 


| shall speak in round numbers, not absolutely accurate, 
yet not so wide from truth as to vary the result materially. 


— THOMAS JEFFERSON (1824) 


4.2.2. Accuracy of Floating Point Arithmetic 

Floating point computation is by nature inexact, and programmers can easily 
misuse it so that the computed answers consist almost entirely of “noise.” One 
of the principal problems of numerical analysis is to determine how accurate 
the results of certain numerical methods will be. There’s a credibility gap: We 
don’t know how much of the computer’s answers to believe. Novice computer 
users solve this problem by implicitly trusting in the computer as an infallible 
authority; they tend to believe that all digits of a printed answer are significant. 
Disillusioned computer users have just the opposite approach; they are constantly 
afraid that their answers are almost meaningless. Many serious mathematicians 
have attempted to analyze a sequence of floating point operations rigorously, 
but have found the task so formidable that they have tried to be content with 
plausibility arguments instead. 

A thorough examination of error analysis techniques is beyond the scope 
of this book, but in the present section we shall study some of the low-level 
characteristics of floating point arithmetic errors. Our goal is to discover how 
to perform floating point arithmetic in such a way that reasonable analyses of 
error propagation are facilitated as much as possible. 

A rough (but reasonably useful) way to express the behavior of floating 
point arithmetic can be based on the concept of “significant figures” or relative 
error. If we are representing an exact real number æ inside a computer by 
using the approximation ĉ = z(1 + €), the quantity « = (ĉ — x)/z is called the 
relative error of approximation. Roughly speaking, the operations of floating 
point multiplication and division do not magnify the relative error by very 
much; but floating point subtraction of nearly equal quantities (and floating 
point addition, u ® v, where u is nearly equal to —v) can very greatly increase 
the relative error. So we have a general rule of thumb, that a substantial loss 
of accuracy is expected from such additions and subtractions, but not from 
multiplications and divisions. On the other hand, the situation is somewhat 
paradoxical and needs to be understood properly, since the “bad” additions and 
subtractions are always performed with perfect accuracy! (See exercise 25.) 

One of the consequences of the possible unreliability of floating point addi- 
tion is that the associative law breaks down: 


(ugu) Dw#ug (vew), for many u,v, w. (1) 
For example, 


(11111113. @ —11111111.) © 7.5111111 = 2.0000000 & 7.5111111 = 9.5111111; 
11111113. 6 (—11111111. $ 7.5111111) = 11111113. 6 —11111103. = 10.000000. 
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(All examples in this section are given in eight-digit floating decimal arithmetic, 
with exponents indicated by an explicit decimal point. Recall that, as in Section 
4.2.1, the symbols $, ©, ®, @ are used to stand for floating point operations 
that correspond to the exact operations +, —, x, /.) 

In view of the failure of the associative law, the comment of Mrs. La Touche 
that appears at the beginning of this chapter makes a good deal of sense with 
respect to floating point arithmetic. Mathematical notations like “a; + a2 + a3” 
or “J -1 a,” are inherently based upon the assumption of associativity, so 
a programmer must be especially careful not to assume implicitly that the 
associative law is valid. 


A. An axiomatic approach. Although the associative law is not valid, the 
commutative law 


uBv=veu (2) 


does hold, and this law can be a valuable conceptual asset in programming and 
in the analysis of programs. Equation (2) suggests that we should look for 
additional examples of important laws that are satisfied by 6, ©, ®, and Ø; 
it is not unreasonable to say that floating point routines should be designed to 
preserve as many of the ordinary mathematical laws as possible. If more axioms 
are valid, it becomes easier to write good programs, and programs also become 
more portable from machine to machine. 

Let us therefore consider some of the other basic laws that are valid for 
normalized floating point operations as described in the previous section. First 
we have 


u Ov = u v; (3) 
(u ® v) = —u® —v; (4) 
upv=0 if and only if v = —u; (5) 
u@0=u. (6) 
From these laws we can derive further identities; for example (exercise 1), 
uOv=-(v ou). (7) 


Identities (2) to (6) are easily deduced from the algorithms in Section 4.2.1. 
The following rule is slightly less obvious: 


if u<v then upw<veuw. (8) 


Instead of attempting to prove this rule by analyzing Algorithm 4.2.1A, let us go 
back to the basic principle by which that algorithm was designed. (Algorithmic 
proofs aren’t always easier than mathematical ones.) Our idea was that the 
floating point operations should satisfy 


u v = round (u + v), u Ov = round(u — v), 


(9) 


u & v = round(u x v), u@v=round(u / v), 
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where round(z) denotes the best floating point approximation to x as defined in 
Algorithm 4.2.1N. We have 

round(—x) = —round(z), (10) 

TIY implies round(a) < round(y), (11) 
and these fundamental relations yield properties (2) through (8) immediately. 
We can also write down several more identities: 
UBv=vSuyu, (—u) 8 v = —(u®v), 18&v=v; 
ugv=0 if and only if u= 0 or v = 0; 
(—u) v = u (-v) = —-(u@ v); 
0v=0, u@l=u, uðu=l. 


If u < v and w > 0, then u& w < v & w and uw < v © w; also wu > wv 
when v > u > 0. If uv = u +v, then (u@v) Ov = u; andifu@v=uxv #0, 
then (u&v) v =u. We see that a good deal of regularity is present in spite of 
the inexactness of the floating point operations, when things have been defined 
properly. 

Several familiar rules of algebra are still, of course, conspicuously absent 
from the collection of identities above. The associative law for floating point 
multiplication is not strictly true, as shown in exercise 3, and the distributive 
law between ® and È can fail rather badly: Let u = 20000.000, v = —6.0000000, 
and w = 6.0000003; then 


(u@ v) 6 (u@ w) = —120000.00 $ 120000.01 = .010000000 
u & (v @ w) = 20000.000 & .00000030000000 = .0060000000 
so 
uD (v w) # (u 8v) (u@w). (12) 
On the other hand we do have b & (v @ w) = (b & v) © (b & w), when b is the 
floating point radix, since 


round(bz) = bround(z). (13) 


(Strictly speaking, the identities and inequalities we are considering in this 
section implicitly assume that exponent underflow and overflow do not occur. 
The function round(x) is undefined when |z| is too small or too large, and 
equations such as (13) hold only when both sides are defined.) 

The failure of Cauchy’s fundamental inequality 


(ap tes + ar y(yp +--+ +92) > (ayi e+ + nY)? 


is another important example of the breakdown of traditional algebra in the 
presence of floating point arithmetic. Exercise 7 shows that Cauchy’s inequality 
can fail even in the simple case n = 2, zı = x2 = 1. Novice programmers who 
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calculate the standard deviation of some observations by using the textbook 


formula 
g= aza oa] ) [mma (14) 
1<k<n 1<k<n 


often find themselves taking the square root of a negative number! A much better 
way to calculate means and standard deviations with floating point arithmetic 
is to use the recurrence formulas 


Mı = tı, My = Mp—1 © (£k © Mp1) Ø k, (15) 
Sı =0, Sk = Sk—1 D (£k © My_1) Q (£k © My), (16) 


for 2 < k < n, where o = y/Sn/(n— 1). [See B. P. Welford, Technometrics 4 
(1962), 419-420.] With this method Sn can never be negative, and we avoid 
other serious problems encountered by the naïve method of accumulating sums, 
as shown in exercise 16. (See exercise 19 for a summation technique that provides 
an even better guarantee on the accuracy.) 

Although algebraic laws do not always hold exactly, we can often show that 
they aren’t too far off base. When b®! < |a| < b® we have round(x) = £ + p(x), 
where |p(x)| < $b°~?; hence 


round(x) = z(1+ 6(z)), (17) 
where the relative error is bounded independently of x: 


lpe—p 
|6(z)| = ual < m < TE l pe-p 
|x| + |o()| +a 


< ibr. (18) 


We can use this inequality to estimate the relative error of normalized floating 
point calculations in a simple way, since u @ v = (u + v)(1 + ô(u + v)), ete. 

As an example of typical error-estimation procedures, let us consider the 
associative law for multiplication. Exercise 3 shows that (u & v) & w is not in 
general equal to u& (v 8 w); but the situation in this case is much better than it 
was with respect to the associative law of addition (1) and the distributive law 
(12). In fact, we have 


(u 8v) @ w = ((wv)(1+ 81)) 8 w = uvw(1 + ô1)(1 + 69), 


u® (v@w) = u® ((vw)(1+ 53)) = uvw(1 + 63)(1 + 64), 
for some 61, 62, 63, 64, provided that no exponent underflow or overflow occurs, 
where |6;| < $b'~? for each j. Hence 
(u® v) Qw — (1 -= ô1)(1 =m 62) 


ug (v 8w) Grala S 


where g 
|5| < 217P / (1 — 401P)". (19) 
The number b'~? occurs so often in such analyses, it has been given a special 
name, one ulp, meaning one unit in the last place of the fraction part. Floating 
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point operations are correct to within half an ulp, and the calculation of uvw by 
two floating point multiplications will be correct within about one ulp (ignoring 
second-order terms). Hence the associative law for multiplication holds to within 
about two ulps of relative error. 

We have shown that (u & v) & w is approximately equal to u & (v & w), 
except when exponent overflow or underflow is a problem. It is worthwhile to 
study this intuitive idea of approximate equality in more detail; can we make 
such a statement more precise in a reasonable way? 

Programmers who use floating point arithmetic almost never want to test 
if two computed values are exactly equal to each other (or at least they hardly 
ever should try to do so), because this is an extremely improbable occurrence. 
For example, if a recurrence relation 


Tn+1 = (En) 


is being used, where the theory in some textbook says that £n approaches a 
limit as n > ov, it is usually a mistake to wait until z,4,; = £n for some n, since 
the sequence £n might be periodic with a longer period due to the rounding of 
intermediate results. The proper procedure is to wait until |£n+1 — £n| < ô, for 
some suitably chosen number 6; but since we don’t necessarily know the order 
of magnitude of £n in advance, it is even better to wait until 


|En+1 — Tn] < elzen]; (20) 


now € is a number that is much easier to select. Relation (20) is another 
way of saying that £n+1 and x, are approximately equal; and our discussion 
indicates that a relation of “approximately equal” would be more useful than the 
traditional relation of equality, when floating point computations are involved, 
if we could only define a suitable approximation relation. 

In other words, the fact that strict equality of floating point values is of 
little importance implies that we ought to have a new operation, floating point 
comparison, which is intended to help assess the relative values of two floating 
point quantities. The following definitions seem to be appropriate for base b, 
excess q, floating point numbers u = (eu, fu) and v = (ev, fy): 


ux<v_ (e) if and only if v—u > emax(b%~ 4, b& 4); (21) 
unv (e) if and only if |v — u| < emax(b% 74, b& 4); (22) 
u>v_ (e) if and only if u—v > emax(b 71, b°~4); (23) 
uxu (e) if and only if |v — u| < emin(b% 4, b& 4), (24) 


These definitions apply to unnormalized values as well as to normalized ones. 
Notice that exactly one of the conditions u < v (definitely less than), u ~ v 
(approximately equal to), or u > v (definitely greater than) must always hold 
for any given pair of values u and v. The relation u ~% v is somewhat stronger 
than u ~ v, and it might be read “u is essentially equal to v.” All of the relations 
are specified in terms of a positive real number e that measures the degree of 
approximation being considered. 
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One way to view the definitions above is to associate a “neighborhood” set 
N(u) = {x | |a — u| < eb& 2} with each floating point number u; thus, N (u) 
represents a set of values near u based on the exponent of u’s floating point rep- 
resentation. In these terms, we have u < v if and only if N(u) < v and u < N (v); 
u ~ v if and only if u € N(v) or v € N(u); u > v if and only if u > N(v) and 
N(u) >v; u x v if and only if u € N(v) and v € N(u). (Here we are assuming 
that the parameter e, which measures the degree of approximation, is a constant; 
a more complete notation would indicate the dependence of N (u) upon e.) 

Here are some simple consequences of definitions (21)—(24): 


if uxv (e then v>=u (e); (25) 
if ux (e then u~v (e); (26) 
umu (e); (27) 
if uwxv (e) then u <v; (28) 
if u<v (e1:) and | >ez then ux<v_ (e2); (29) 
if u~v (e) and & <e@ then u~v (e2); (30) 
if uwv (ea) and ace then uxu (e2); (31) 
if u<v (e) and vxw (e2) then u<w (min(e1,€2)); (32) 
if uwv (ea) and vxw (e2) then u~w (e& +e). (33) 
Moreover, we can prove without difficulty that 
ju— v| < eļu| and |u—v| < elo} implies uxu (e); (34) 
ju— v| < eļu| or  Ju—vļ< eļv| implies u~wv (e); (35) 
and conversely, for normalized floating point numbers u and v, when e < 1, 
uxu (e) implies ju — v| < beļu| and |u—v| < belo; (36) 
u~v (e) implies ju — v| <belu| or |u—v| < belo}. (37) 
Let co = bt? be one ulp. The derivation of (17) establishes the inequality 
|x — round(x)| = |p(x)| < eo min (|x|, |round(zx)|), hence 
x ~ round(x) (4€0); (38) 


it follows that u@vu x% u +v (ž€0), etc. The approximate associative law for 
multiplication derived above can be recast as follows: We have 


|(u@v)@w-u®(v®w)| < “Es [ve (vow) 


(1— $e 
by (19), and the same inequality is valid with (u & v) ® w and u 8 (v & w) 
interchanged. Hence by (34), 

(uv) @wru@(v@w) (e) (39) 


whenever € > 2¢ /(1 — $€0)7. For example, if b = 10 and p = 8 we may take 
e = 0.00000021. 
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The relations <, ~, >, and ~ are useful within numerical algorithms, and it 
is therefore a good idea to provide routines for comparing floating point numbers 
as well as for doing arithmetic on them. 


Let us now shift our attention back to the question of finding exact relations 
that are satisfied by the floating point operations. It is interesting to note that 
floating point addition and subtraction are not completely intractable from an 
axiomatic standpoint, since they do satisfy the nontrivial identities stated in the 
following theorems. 


Theorem A. Let u and v be normalized floating point numbers. Then 


((u@v) Ou) + ((U@v) O ((U@v) Ou)) =uey, (40) 
provided that no exponent overflow or underflow occurs. 


This rather cumbersome-looking identity can be rewritten in a simpler manner: 
Let 


u =(u@v) ev, v =(u@v) Cu; i 
41 
u” = (upv) Ov, v” = (upv) ev. 


Intuitively, u’ and u” should be approximations to u, and v’ and v” should be 
approximations to v. Theorem A tells us that 


u@v=au +u" =u" +v. (42) 


This is a stronger statement than the identity 


upv=u pv =u v, (43) 
which follows by rounding (42). 
Proof. Let us say that t is a tail of x modulo b° if 
t= x (modulo 0°), lt| < 30°; (44) 


thus, x — round(x) is always a tail of x. The proof of Theorem A rests largely 
on the following simple fact proved in exercise 11: 


Lemma T. Ift is a tail of the floating point number x, then cOt=x-—-t. I 


Let w = u v. Theorem A holds trivially when w = 0. By multiplying all 
variables by a suitable power of b, we may assume without loss of generality that 
€w = p. Then u +v = w +r, where r is a tail of u +v modulo 1. Furthermore 
u’ = round (w — v) = round(u— r) = u—r — t, where t is a tail of u—r modulo b° 
and e = ew — p. 

If e < 0, then t = u — r = —v (modulo b°), hence t is a tail of —v and 
v” = round(w — u’) = round(v + t) = v + t; this proves (40). If e > 0, then 
|u — r| > b? — 4; and since |r| < 4, we have |u| > b? — 1. It follows that u is 
an integer, so r is a tail of v modulo 1. If u’ = u, then t = —r is a tail of —v. 
Otherwise the relation round(u — r) # u implies that |u| = 6? — 1, |r| = 4, 
ju’] = bP, t = r; again t is a tail of —v. | 
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Theorem A exhibits a regularity property of floating point addition, but it 
doesn’t seem to be an especially useful result. The following identity is more 
significant: 


Theorem B. Under the hypotheses of Theorem A and (41), 
utv=(u@v) + ((uou') (vev). (45) 


Proof. In fact, we can show that u O u’ = u — u’, v O v” = v — v”, and 
(u— u’) @ (v — v”) = (u— u’) + (v — v”), hence (45) will follow from Theorem A. 
Using the notation of the preceding proof, these relations are respectively equiv- 
alent to 


round(t+r)=t+r, round(t) = t, round(r) = r. (46) 


Exercise 12 establishes the theorem in the special case |e, — e,| > p. Otherwise 
u+v has at most 2p significant digits and it is easy to see that round(r) = r. If 
now e > 0, the proof of Theorem A shows that t = —r or t= r = +5. Ife<0 
we have t +r = u and t = —v (modulo b°); this is enough to prove that t +r 
and t round to themselves, provided that e„ > e and e, > e. But either e„ < 0 
or €y < 0 would contradict our hypothesis that |e, — e»| < p, since ew =p. | 


Theorem B gives an explicit formula for the difference between u + v and 
u®v, in terms of quantities that can be calculated directly using five operations 
of floating point arithmetic. If the radix b is 2 or 3, we can improve on this 
result, obtaining the exact value of the correction term with only two floating 
point operations and one (fixed point) comparison of absolute values: 


Theorem C. Ifb< 3 and |u| > |v|, then 


utv=(uSv)+(uS(uSv)) Sv. (47) 


Proof. Following the conventions of preceding proofs again, we wish to show 
that vO v’ = r. It suffices to show that v’ = w — u, because (46) will then yield 
v Ov’ = round(v — v’) = round(u + v — w) = round(r) = r. 

We shall in fact prove (47) whenever b < 3 and e, > e». If eu > p, then r 
is a tail of v modulo 1, hence v’ = w O u = v Or = v — r = w — was desired. 
If e„ < p, then we must have e„ = p — 1, and w — u is a multiple of b7}; it will 
therefore round to itself if its magnitude is less than b?~! +7. Since b < 3, we 
have indeed |w — ul < |w — u — v| + |v] < $+ (P71 — b7!) <b? 1405-1. This 
completes the proof. J 


The proofs of Theorems A, B, and C do not rely on the precise definitions of 
round(x) in the ambiguous cases when zv is exactly midway between consecutive 
floating point numbers; any way of resolving the ambiguity will suffice for the 
validity of everything we have proved so far. 

No rounding rule can be best for every application. For example, we gener- 
ally want a special rule when computing our income tax. But for most numerical 
calculations the best policy appears to be the rounding scheme specified in 
Algorithm 4.2.1N, which insists that the least significant digit should always 
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be made even (or always odd) when an ambiguous value is rounded. This is not 
a trivial technicality, of interest only to nit-pickers; it is an important practical 
consideration, since the ambiguous case arises surprisingly often and a biased 
rounding rule produces significantly poor results. For example, consider decimal 
arithmetic and assume that remainders of 5 are always rounded upwards. Then if 
u = 1.0000000 and v = 0.55555555 we have u@v = 1.5555556; and if we floating- 
subtract v from this result we get u’ = 1.0000001. Adding and subtracting v 
from u’ gives 1.0000002, and the next time we get 1.0000003, etc.; the result 
keeps growing although we are adding and subtracting the same value. 

This phenomenon, called drift, will not occur when we use a stable rounding 
rule based on the parity of the least significant digit. More precisely: 


Theorem D. (((u@v) Gv) 6v) v= (uv) Sv. 
For example, if u = 1.2345679 and v = —0.23456785, we find 
u ® v = 1.0000000, (u ® v) © v = 1.2345678, 
((u@ v) © v) @ v = 0.99999995, (((u v) Ov) @ v) © v = 1.2345678. 


The proof for general u and v seems to require a case analysis even more detailed 
than that in the theorems above; see the references below. J 


Theorem D is valid both for “round to even” and “round to odd”; how should 
we choose between these possibilities? When the radix b is odd, ambiguous cases 
never arise except during floating point division, and the rounding in such cases 
is comparatively unimportant. For even radices, there is reason to prefer the 
following rule: “Round to even when b/2 is odd, round to odd when b/2 is 
even.” The least significant digit of a floating point fraction occurs frequently 
as a remainder to be rounded off in subsequent calculations, and this rule avoids 
generating the digit b/2 in the least significant position whenever possible; its 
effect is to provide some memory of an ambiguous rounding so that subsequent 
rounding will tend to be unambiguous. For example, if we were to round to 
odd in the decimal system, repeated rounding of the number 2.44445 to one less 
place each time leads to the sequence 2.4445, 2.445, 2.45, 2.5, 3; if we round to 
even, such situations do not occur, although repeated rounding of a number like 
2.5454 will lead to almost as much error. [See Roy A. Keir, Inf. Proc. Letters 
3 (1975), 188-189.] Some people prefer rounding to even in all cases, so that 
the least significant digit will tend to be 0 more often. Exercise 23 demonstrates 
this advantage of round-to-even. Neither alternative conclusively dominates the 
other; fortunately the base is usually b = 2 or b = 10, when everyone agrees that 
round-to-even is best. 

A reader who has checked some of the details of the proofs above will realize 
the immense simplification that has been afforded by the simple rule u@ v = 
round(u + v). If our floating point addition routine would fail to give this result 
even in a few rare cases, the proofs would become enormously more complicated 
and perhaps they would even break down completely. 

Theorem B fails if truncation arithmetic is used in place of rounding, that 
is, if we let u ® v = trunc(u+ v) and u Ov = trunc(u — v), where trunc(z) for a 
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positive real x is the largest floating point number < x. An exception to Theo- 
rem B would then occur for cases such as (20, +.10000001) ® (10, —.10000001) = 
(20, +.10000000), when the difference between u+v and w@v cannot be expressed 
exactly as a floating point number; and also for cases such as 12345678 ® 
.012345678, when it can be. 

Many people feel that, since floating point arithmetic is inexact by nature, 
there is no harm in making it just a little bit less exact in certain rather rare cases, 
if it is convenient to do so. This policy saves a few cents in the design of computer 
hardware, or a small percentage of the average running time of a subroutine. But 
our discussion shows that such a policy is mistaken. We could save about five 
percent of the running time of the FADD subroutine, Program 4.2.1A, and about 
25 percent of its space, if we took the liberty of rounding incorrectly in a few 
cases, but we are much better off leaving it as it is. The reason is not to glorify 
“bit chasing”; a more fundamental issue is at stake here: Numerical subroutines 
should deliver results that satisfy simple, useful mathematical laws whenever 
possible. The crucial formula u ® v = round(u + v) is a regularity property 
that makes a great deal of difference between whether mathematical analysis 
of computational algorithms is worth doing or worth avoiding. Without any 
underlying symmetry properties, the job of proving interesting results becomes 
extremely unpleasant. The enjoyment of one’s tools is an essential ingredient of 
successful work. 


B. Unnormalized floating point arithmetic. The policy of normalizing all 
floating point numbers may be construed in two ways: We may look on it favor- 
ably by saying that it is an attempt to get the maximum possible accuracy ob- 
tainable with a given degree of precision, or we may consider it to be potentially 
dangerous since it tends to imply that the results are more accurate than they 
really are. When we normalize the result of (1,-+.31428571) © (1, +.31415927) 
to (—2, +.12644000), we are suppressing information about the possibly greater 
inaccuracy of the latter quantity. Such information would be retained if the 
answer were left as (1, +.00012644). 

The input data to a problem is frequently not known as precisely as the 
floating point representation allows. For example, the values of Avogadro’s 
number and Planck’s constant are not known to eight significant digits, and 
it might be more appropriate to denote them, respectively, by 


(27, +.00060221) and (—23, +.00066261) 


instead of by (24,+.60221400) and (—26,+.66261000). It would be nice if 
we could give our input data for each problem in an unnormalized form that 
expresses how much precision is assumed, and if the output would indicate just 
how much precision is known in the answer. Unfortunately, this is a terribly 
difficult problem, although the use of unnormalized arithmetic can help to give 
some indication. For example, we can say with a fair degree of certainty that the 
product of Avogadro’s number by Planck’s constant is (1, +.00039903), and that 
their sum is (27, +.00060221). (The purpose of this example is not to suggest that 
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any important physical significance should be attached to the sum and product 
of these fundamental constants; the point is that it is possible to preserve a little 
of the information about precision in the result of calculations with imprecise 
quantities, when the original operands are independent of each other.) 

The rules for unnormalized arithmetic are simply this: Let lą be the number 
of leading zeros in the fraction part of u = (eu, fu), so that lu is the largest integer 
< p with | fu] < bt». Then addition and subtraction are performed just as in 
Algorithm 4.2.1A, except that all scaling to the left is suppressed. Multiplication 
and division are performed as in Algorithm 4.2.1M, except that the answer is 
scaled right or left so that precisely max(lu, ly) leading zeros appear. Essentially 
the same rules have been used in manual calculation for many years. 

It follows that, for unnormalized computations, 


Cu@v Cuov = MAX (eu, €v) + (0 or 1) (48) 
Eugv = Cu + Cy — q - min(la, ly) = (0 or 1) (49) 
Cuov = Cu — €v + q — lu + ly + max(lu, ly) + (0 or 1). (50) 


When the result of a calculation is zero, an unnormalized zero (often called an 
“order of magnitude zero”) is given as the answer; this indicates that the answer 
may not truly be zero, we just don’t know any of its significant digits. 

Error analysis takes a somewhat different form with unnormalized floating 
point arithmetic. Let us define 


Oy = FOTP if u = (€u, fu). (51) 
This quantity depends on the representation of u, not just on the value b®!74 fu. 
Our rounding rule tells us that 
luv- (u +v)| < due, luov — (u — v)| < fuov, 
lu 8v- (u x v)| < due, luv- (u / v)| < uov- 


These inequalities apply to normalized as well as unnormalized arithmetic; the 
main difference between the two types of error analysis is the definition of the 
exponent of the result of each operation (Eqs. (48) to (50)). 

We have remarked that the relations <, ~, >, and ~ defined earlier in 
this section are valid and meaningful for unnormalized numbers as well as for 
normalized numbers. As an example of the use of these relations, let us prove 
an approximate associative law for unnormalized addition, analogous to (39): 


(upv) Dwxupwepuw) (e), (52) 
for suitable e. We have 
(uv) w- (u+v+w)|< (upv) Sw- ((u@v)+w)|+lu@v—(utv)| 
= O(u@vjow Eg Ôupv 


< 26(u@v)ow : 


A similar formula holds for |u @ (v @ w) — (u + v + w)|. Now since €(ugv)ow = 
max(€y, €v, €w) +(0, 1, or 2), we have d(uev)ow < 6?5ue(v@w). Therefore we find 
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that (52) is valid when € > b?~? + b-?; unnormalized addition is not as erratic 
as normalized addition with respect to the associative law. 

It should be emphasized that unnormalized arithmetic is by no means a 
panacea. There are examples where it indicates greater accuracy than is present 
(for example, addition of a great many small quantities of about the same magni- 
tude, or evaluation of x” for large n); and there are many more examples when it 
indicates poor accuracy while normalized arithmetic actually does produce good 
results. There is an important reason why no straightforward one-operation-at- 
a-time method of error analysis can be completely satisfactory, namely the fact 
that operands are usually not independent of each other. This means that errors 
tend to cancel or reinforce each other in strange ways. For example, suppose that 
x is approximately 1/2, and suppose that we have an approximation y = x + ð 
with absolute error 6. If we now wish to compute x(1—«x), we can form y(1-— y); 
if x = $ + e we find y(1 — y) = z(1 — x) — 2e — 6”, so the absolute error has 
decreased substantially: It has been multiplied by a factor of 2e + 6. This is 
just one case where multiplication of imprecise quantities can lead to a quite 
accurate result when the operands are not independent of each other. A more 
obvious example is the computation of «©, which can be obtained with perfect 
accuracy regardless of how bad an approximation to x we begin with. 

The extra information that unnormalized arithmetic gives us can often be 
more important than the information it destroys during an extended calcula- 
tion, but (as usual) we must use it with care. Examples of the proper use of 
unnormalized arithmetic are discussed by R. L. Ashenhurst and N. Metropolis 
in Computers and Computing, AMM, Slaught Memorial Papers 10 (February 
1965), 47-59; by N. Metropolis in Numer. Math. 7 (1965), 104-112; and by 
R. L. Ashenhurst in Error in Digital Computation 2, edited by L. B. Rall 
(New York: Wiley, 1965), 3-37. Appropriate methods for computing standard 
mathematical functions with both input and output in unnormalized form are 
given by R. L. Ashenhurst in JACM 11 (1964), 168-187. An extension of 
unnormalized arithmetic, which remembers that certain values are known to 
be exact, has been discussed by N. Metropolis in IEEE Trans. C-22 (1973), 
573-576. 


C. Interval arithmetic. Another approach to the problem of error determi- 
nation is the so-called interval or range arithmetic, in which rigorous upper and 
lower bounds on each number are maintained during the calculations. Thus, for 
example, if we know that up < u < uz and vp < v < v1, we represent this by the 
interval notation u = [uo . . u1], v = [vo.. v1]. The sum u@v is [up Y vo . . u1 Avi], 
where y denotes “lower floating point addition,” the greatest representable 
number less than or equal to the true sum, and A is defined similarly (see 
exercise 4.2.1-13). Furthermore u Ov = [up Ẹ v1 . . u1 A vo]; and if ug and vo are 
positive, we have u & v = [up Y vo.. u1 A v1], u v = [uo Y v1.. u1 A vo]. For 
example, we might represent Avogadro’s number and Planck’s constant as 


N = [(24, +.60221331) . . (24, +.60221403)], 
h = [(—26, +.66260715) . . (—26, +.66260795)]; 
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their sum and product would then turn out to be 


N ® h = [(24, +.60221331) . . (24, +.60221404)], 
N @ h = [(—2, +.39903084) . . (—2, +.39903181)]. 


If we try to divide by [vo ..vı] when vo < 0 < v1, there is a possibility of 
division by zero. Since the philosophy underlying interval arithmetic is to provide 
rigorous error estimates, a divide-by-zero error should be signalled in this case. 
However, overflow and underflow need not be treated as fatal errors in interval 
arithmetic, if special conventions are introduced as discussed in exercise 24. 

Interval arithmetic takes only about twice as long as ordinary arithmetic, 
and it provides truly reliable error estimates. (Considering the difficulty of 
mathematical error analyses, this is indeed a small price to pay. Since the 
intermediate values in a calculation often depend on each other, as explained 
above, the final estimates obtained with interval arithmetic will tend to be 
pessimistic; and iterative numerical methods often have to be redesigned if we 
want to deal with intervals. However, the prospects for effective use of interval 
arithmetic look very good, so efforts should be made to increase its availability 
and to make it as user-friendly as possible. 


D. History and bibliography. Jules Tannery’s classic treatise on decimal 
calculations, Leçons d’Arithmétique (Paris: Colin, 1894), stated that positive 
numbers should be rounded upwards if the first discarded digit is 5 or more; 
since exactly half of the decimal digits are 5 or more, he felt that this rule would 
round upwards exactly half of the time, on the average, so it would produce 
compensating errors. The idea of “round to even” in the ambiguous cases seems 
to have been mentioned first by James B. Scarborough in the first edition of his 
pioneering book Numerical Mathematical Analysis (Baltimore: Johns Hopkins 
Press, 1930), 2; in the second (1950) edition he amplified his earlier remarks, 
stating that “It should be obvious to any thinking person that when a 5 is cut 
off, the preceding digit should be increased by 1 in only half the cases,” and he 
recommended round-to-even in order to achieve this. 

The first analysis of floating point arithmetic was given by F. L. Bauer and K. 
Samelson, Zeitschrift für angewandte Math. und Physik 4 (1953), 312-316. The 
next publication was not until over five years later: J. W. Carr III, CACM 2,5 
(May 1959), 10-15. See also P. C. Fischer, Proc. ACM Nat. Meeting 13 (1958), 
Paper 39. The book Rounding Errors in Algebraic Processes (Englewood Cliffs: 
Prentice-Hall, 1963), by J. H. Wilkinson, shows how to apply error analysis of 
the individual arithmetic operations to the error analysis of large-scale problems; 
see also his treatise on The Algebraic Eigenvalue Problem (Oxford: Clarendon 
Press, 1965). 

Additional early work on floating point accuracy is summarized in two 
important papers that can be especially recommended for further study: W. M. 
Kahan, Proc. IFIP Congress (1971), 2, 1214-1239; R. P. Brent, IEEE Trans. 
C-22 (1973), 601-607. Both papers include useful theory and demonstrate that 
it pays off in practice. 
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The relations <, ~, >, œ% introduced in this section are similar to ideas 
published by A. van Wijngaarden in BIT 6 (1966), 66-81. Theorems A and B 
above were inspired by some related work of Ole Møller, BIT 5 (1965), 37-50, 
251-255; Theorem C is due to T. J. Dekker, Numer. Math. 18 (1971), 224- 
242. Extensions and refinements of all three theorems have been published by 
S. Linnainmaa, BIT 14 (1974), 167-202. W. M. Kahan introduced Theorem D 
in some unpublished notes; for a complete proof and further commentary, see 
J. F. Reiser and D. E. Knuth, Inf. Proc. Letters 3 (1975), 84-87, 164. 

Unnormalized floating point arithmetic was recommended by F. L. Bauer 
and K. Samelson in the article cited above, and it was independently used by 
J. W. Carr III at the University of Michigan in 1953. Several years later, the 
MANIAC III computer was designed to include both kinds of arithmetic in its 
hardware; see R. L. Ashenhurst and N. Metropolis, JACM 6 (1959), 415-428, 
IEEE Trans. EC-12 (1963), 896-901; R. L. Ashenhurst, Proc. Spring Joint Com- 
puter Conf. 21 (1962), 195-202. See also H. L. Gray and C. Harrison, Jr., Proc. 
Eastern Joint Computer Conf. 16 (1959), 244-248, and W. G. Wadey, JACM 7 
(1960), 129-139, for further early discussions of unnormalized arithmetic. 

For early developments in interval arithmetic, and some modifications, see 
A. Gibb, CACM 4 (1961), 319-320; B. A. Chartres, JACM 13 (1966), 386- 
403; and the book Interval Analysis by Ramon E. Moore (Prentice-Hall, 1966). 
The subsequent flourishing of this subject is described in Moore’s later book, 
Methods and Applications of Interval Analysis (Philadelphia: SIAM, 1979). 

An extension of the Pascal language that allows variables to be of type 
“interval” was developed at the University of Karlsruhe in the early 1980s. For 
a description of this language, which also includes numerous other features for 
scientific computing, see Pascal-SC by Bohlender, Ullrich, Wolff von Gudenberg, 
and Rall (New York: Academic Press, 1987). 

The book Grundlagen des numerischen Rechnens: Mathematische Begriin- 
dung der Rechnerarithmetik by Ulrich Kulisch (Mannheim: Bibl. Inst., 1976) 
is entirely devoted to the study of floating point arithmetic systems. See also 
Kulisch’s article in IEEE Trans. C-26 (1977), 610-621, and his more recent book 
written jointly with W. L. Miranker, entitled Computer Arithmetic in Theory 
and Practice (New York: Academic Press, 1981). 

An excellent summary of more recent work on floating point error analysis 
appears in the book Accuracy and Stability of Numerical Algorithms by N. J. 
Higham (Philadelphia: SIAM, 1996). 


EXERCISES 
Note: Normalized floating point arithmetic is assumed unless the contrary is specified. 
1. [M18] Prove that identity (7) is a consequence of (2) through (6). 
2. [M20] Use identities (2) through (8) to prove that (u Ð xz) P (vy) > upv 
whenever x > 0 and y > 0. 
3. [M20] Find eight-digit floating decimal numbers u, v, and w such that 
u®(v@®w)F (uv) 8w, 
and such that no exponent overflow or underflow occurs during the computations. 
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A. [10] Is it possible to have floating point numbers u, v, and w for which exponent 
overflow occurs during the calculation of u & (v & w) but not during the calculation of 
(u@v) 8 w? 

5. [M20] Is uðv = u8 (1v) an identity, for all floating point numbers u and v # 0 
such that no exponent overflow or underflow occurs? 

6. [M22] Are either of the following two identities valid for all floating point num- 
bers u? (a) 09 (09 u) = u; (b) 1 (1u) =u. 

7. [M21] Let u® stand for u@ u. Find floating binary numbers u and v such that 
(u@v)® > 2(u® + vO). 

8. [20] Let e= 0.0001; which of the relations 


ux<v_ (e), uwv (e), u>v (e), uwv (e) 


hold for the following pairs of base 10, excess 0, eight-digit floating point numbers? 
a) u = (1,+.31415927), v = (1, +.31416000); 


b) u = (0, +.99997000), v = (1, +.10000039); 

c) u = (24, +.60221400), v = (27, +.00060221); 
d) u = (24, +.60221400), v = (31, +.00000006); 
e) u = (24, +.60221400), v = (28, +.00000000). 


9. [M22] Prove (33), and explain why the conclusion cannot be strengthened to the 
relation u ~ w (€1 + €2). 
10. [M25] (W. M. Kahan.) A certain computer performs floating point arithmetic 
without proper rounding, and, in fact, its floating point multiplication routine ignores 
all but the first p most significant digits of the 2p-digit product fufu. (Thus when 
fufv < 1/b, the least-significant digit of u &® v always comes out to be zero, due to 
subsequent normalization.) Show that this causes the monotonicity of multiplication 
to fail; in other words, exhibit positive normalized floating point numbers u, v, and w 
such that u < v but u & w > v & w on this machine. 


11. [M20] Prove Lemma T. 
12. [M24] Carry out the proof of Theorem B and (46) when |eu — ev| > p. 


13. [M25] Some programming languages (and even some computers) make use of 
floating point arithmetic only, with no provision for exact calculations with integers. If 
operations on integers are desired, we can, of course, represent an integer as a floating 
point number; and when the floating point operations satisfy the basic definitions in 
(9), we know that all floating point operations will be exact, provided that the operands 
and the answer can each be represented exactly with p significant digits. Therefore — so 
long as we know that the numbers aren’t too large — we can add, subtract, or multiply 
integers with no inaccuracy due to rounding errors. 

But suppose that a programmer wants to determine if m is an exact multiple of n, 
when m and n Æ 0 are integers. Suppose further that a subroutine is available to 
calculate the quantity round(u mod 1) = u 1 for any given floating point num- 
ber u, as in exercise 4.2.1-15. One good way to determine whether or not m is a 
multiple of n might be to test whether or not (m Ø n) 1 = 0, using the assumed 
subroutine; but perhaps rounding errors in the floating point calculations will invalidate 
this test in certain cases. 

Find suitable conditions on the range of integer values n 4 0 and m, such that m 
is a multiple of n if and only if (m n) 1 = 0. In other words, show that if m 
and n are not too large, this test is valid. 
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14. [M27] Find a suitable € such that (u@v) @w ~ u@(v@w) (e), when unnormalized 
multiplication is being used. (This generalizes (39), since unnormalized multiplication 
is exactly the same as normalized multiplication when the input operands u, v, and w 
are normalized.) 


15. [M24] (H. Björk.) Does the computed midpoint of an interval always lie between 
the endpoints? (In other words, does u < v imply that u < (u@ v) 2 < v?) 

16. [M28] (a) What is (---((71@22)@2r3)@:--Oan) when n = 10° and zp = 1.1111111 
for all k, using eight-digit floating decimal arithmetic? (b) What happens when Eq. (14) 
is used to calculate the standard deviation of these particular values x? What happens 
when Eqs. (15) and (16) are used instead? (c) Prove that Sp > 0 in (16), for all choices 
of £1, ..., Lr. 

17. [28] Write a MIX subroutine, FCMP, that compares the floating point number u in 
location ACC with the floating point number v in register A, setting the comparison 
indicator to LESS, EQUAL, or GREATER according as u < v, u ~ v, or u > v (e); here e€ is 
stored in location EPSILON as a nonnegative fixed point quantity with the radix point 
assumed at the left of the word. Assume normalized inputs. 


18. [M40] In unnormalized arithmetic is there a suitable number e such that 
u®(vO@w)X (u@v)/ P(u@w) (e)? 


19. [M30] (W. M. Kahan.) Consider the following procedure for floating point sum- 
mation of 71, £2, ..., Ln: 


So = co = 0; 


Yk = Lk OCk-1, Sk =Sk-1 PD Yk, Ck = (Sk © Sk—-1) O Yk, for l<k<n. 


Let the relative errors in these operations be defined by the equations 


Yk = (£x = ck—1)(1 +1); Sk = (sk—1 H yr) H ok), 
Ck = ((5k — Sk—1)(1 + Yk) — ye) (1 + ôx), 


2e+O(ne?). [Theorem C says that if b = 2 and |sk-1| > |yx| we have Sk-1 +Yk = 8k—Ck 
exactly. But in this exercise we want to obtain an estimate that is valid even when 
floating point operations are not carefully rounded, assuming only that each operation 
has bounded relative error.| 


20. [25] (S. Linnainmaa.) Find all u and v for which |u| > |v| and (47) fails. 


21. [M35] (T. J. Dekker.) Theorem C shows how to do exact addition of floating 
binary numbers. Explain how to do exact multiplication: Express the product uv in 
the form w+w’, where w and w’ are computed from two given floating binary numbers 
u and v, using only the operations @, ©, and ®. 


where |7x|,|o%|, |e], |5%| < e Prove that sn — cn = X p—1(1 + Ox)ax, where |x| < 


22. [M30] Can drift occur in floating point multiplication/division? Consider the 
sequence Xo = U, Lan41 = Lon Q V, Lont2 = Lan+1 Ø V, given u and v Æ 0; what is the 
largest subscript k such that £k A £k+2 is possible? 


23. [M26] Prove or disprove: u © (u 1) = |u], for all floating point u. 


24. [M27] Consider the set of all intervals [u; . . ur], where u; and ur are either nonzero 
floating point numbers or the special symbols +0, —0, +20, —oo; each interval must 
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have u; < ur, and u, = ur is allowed only when w; is finite and nonzero. The interval 
[u: .. ur] stands for all floating point x such that u; < x < ur, where we agree that 


=œ < —r < —0 < 0 < +0 < +r < +00 


for all positive æ. (Thus, [1..2] means 1 < x < 2; [+0..1] means 0 < x < 1; 
[-0..1] means 0 < x < 1; [-0..+0] denotes the single value 0; and [—oo. . +00] stands 
for everything.) Show how to define appropriate arithmetic operations on all such 
intervals, without resorting to overflow or underflow or other anomalous indications 
except when dividing by an interval that includes zero. 


25. [15] When people speak about inaccuracy in floating point arithmetic they often 
ascribe errors to “cancellation” that occurs during the subtraction of nearly equal 
quantities. But when u and v are approximately equal, the difference u © v is obtained 
exactly, with no error. What do these people really mean? 


26. [M21] Given that u, u’, v, and v’ are positive floating point numbers with u ~ 
u' (e) and v ~ v’ (e), prove that there’s a small ¢’ such that u Bv ~ w Bv (e), 
assuming normalized arithmetic. 


27. [M27] (W. M. Kahan.) Prove that 1 (1 (1 © u))=1 u for all u £ 0. 


28. [HM30] (H. G. Diamond.) Suppose f(x) is a strictly increasing function on some 
interval [xo .. xı], and let g(x) be the inverse function. (For example, f and g might 
be “exp” and “ln”, or “tan” and “arctan”.) If x is a floating point number such that 
ro < £ < x1, let f(x) = round(f(x)), and if y is another such that f(ao) < y < f(#1), 
let ĝ(y) = round(g(y)); furthermore, let h(x) = 9(f(x)), whenever this is defined. 
Although h(x) won’t always be equal to x, due to rounding, we expect h(x) to be fairly 
near z. 

Prove that if the precision b” is at least 3, and if f is strictly concave or strictly 
convex (that is, f” (x) has the same sign for all x in [xo .. x1]), then repeated application 
of h will be stable in the sense that 


for all x such that both sides of this equation are defined. In other words, there will 
be no “drift” if the subroutines are properly implemented. 


29. [M25] Give an example to show that the condition b? > 3 is necessary in the 
previous exercise. 


30. [M30] (W. M. Kahan.) Let f(z) = 1+a+---+a' = (1—21°7)/(1— r) for 
x <1, and let g(y) = f(($ — y*)(3 + 3.45y’)) for 0 < y < 1. Evaluate g(y) on one or 
more pocket calculators, for y = 107°, 1074, 1075, 1076, and explain all inaccuracies 
in the results you obtain. (Since most present-day calculators do not round correctly, 
the results are often surprising. Note that g(e) = 107 — 10491.35c? + 659749.9625«+ — 
30141386.26625e° + O(e8).) 


31. [M25] (U. Kulisch.) When the polynomial 2y? + 9z — yf is evaluated for x = 
408855776 and y = 708158977 using standard 53-bit double-precision floating point 
arithmetic, the result is ~ —3.7 x 10!°. Evaluating it in the alternative form 2y? + 
(3x? — y?) (3x? + y?) gives ~ +1.0 x 10'8. The true answer, however, is 1.0 (exactly). 
Explain how to construct similar examples of numerical instability. 


32. [M21] For what pairs (a,b) is round_to_even(x) = |ax + b| + [ax — b] for all x? 


246 ARITHMETIC 4.2.3 


*4.2.3. Double-Precision Calculations 
Up to now we have considered “single-precision” floating point arithmetic, which 
essentially means that the floating point values we have dealt with can be stored 
in a single machine word. When single-precision floating point arithmetic does 
not yield sufficient accuracy for a given application, the precision can be increased 
by suitable programming techniques that use two or more words of memory to 
represent each number. 

Although we shall discuss the general question of high-precision calculations 
in Section 4.3, it is appropriate to give a separate discussion of double-precision 
here. Special techniques apply to double precision that are comparatively inap- 
propriate for higher precisions; and double precision is a reasonably important 
topic in its own right, since it is the first step beyond single precision and it is 
applicable to many problems that do not require extremely high precision. 


a, Well, that paragraph was true when the author wrote the first edition of 

{£ this book in the 1960s. But computers have evolved in such a way that the 
old motivations for double-precision floating point have mostly disappeared; the 
present section is therefore primarily of historical interest. In the planned fourth 
edition of this book, Section 4.2.1 will be renamed “Normalized Calculations,” 
and the present Section 4.2.3 will be replaced by a discussion of “Exceptional 
Numbers.” The new material will focus on special aspects of ANSI/IEEE Stan- 
dard 754: subnormal numbers, infinities, and the so-called NaNs that represent 
undefined or otherwise unusual quantities. (See the references at the end of 
Section 4.2.1.) Meanwhile, let us take one last look at the older ideas, in order 
to see what lessons they can still teach us. 


Double-precision calculations are almost always required for floating point 
rather than fixed point arithmetic, except perhaps in statistical work where fixed 
point double-precision is commonly used to calculate sums of squares and cross 
products; since fixed point versions of double-precision arithmetic are simpler 
than floating point versions, we shall confine our discussion here to the latter. 

Double precision is quite frequently desired not only to extend the precision 
of the fraction parts of floating point numbers, but also to increase the range of 
the exponent part. Thus we shall deal in this section with the following two-word 
format for double-precision floating point numbers in the MIX computer: 


=a ae 2 ae Fi) Ff.) FF Is (1) 


Here two bytes are used for the exponent and eight bytes are used for the fraction. 
The exponent is “excess b?/2,” where b is the byte size. The sign will appear in 
the most significant word; it is convenient to ignore the sign of the other word 
completely. 

Our discussion of double-precision arithmetic will be quite machine-oriented, 
because it is only by studying the problems involved in coding these routines 
that a person can properly appreciate the subject. A careful study of the MIX 
programs below is therefore essential to the understanding of the material. 
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In this section we shall depart from the idealistic goals of accuracy stated 
in the previous two sections; our double-precision routines will not round their 
results, and a little bit of error will sometimes be allowed to creep in. Users 
dare not trust these routines too much. There was ample reason to squeeze out 
every possible drop of accuracy in the single-precision case, but now we face a 
different situation: (a) The extra programming required to ensure true double- 
precision rounding in all cases is considerable; fully accurate routines would take, 
say, twice as much space and half again as much time. It was comparatively 
easy to make our single-precision routines perfect, but double precision brings 
us face to face with our machine’s limitations. A similar situation occurs with 
respect to other floating point subroutines; we can’t expect the cosine routine 
to compute round(cos x) exactly for all x, since that turns out to be virtually 
impossible. Instead, the cosine routine should provide the best relative error it 
can achieve with reasonable speed, for all reasonable values of x. Of course, the 
designer of the routine should try to make the computed function satisfy simple 
mathematical laws whenever possible — for example, 


(0s) (—a) = Gos) a; | ose] <1; Cos)a> Cos)y forO<a<y<n. 


(b) Single-precision arithmetic is a “staple food” that everybody who wants to 
employ floating point arithmetic must use, but double precision is usually for 
situations where such clean results aren’t as important. The difference between 
seven- and eight-place accuracy can be noticeable, but we rarely care about the 
difference between 15- and 16-place accuracy. Double precision is most often 
used for intermediate steps during the calculation of single-precision results; its 
full potential isn’t needed. (c) It will be instructive for us to analyze these 
procedures in order to see how inaccurate they can be, since they typify the 
types of short cuts generally taken in bad single-precision routines (see exercises 
7 and 8). 

Let us now consider addition and subtraction operations from this stand- 
point. Subtraction is, of course, converted to addition by changing the sign of 
the second operand. Addition is performed by separately adding together the 
least-significant halves and the most-significant halves, propagating “carries” 
appropriately. 

A difficulty arises, however, since we are doing signed magnitude arithmetic: 
it is possible to add the least-significant halves and to get the wrong sign (namely, 
when the signs of the operands are opposite and the least-significant half of the 
smaller operand is bigger than the least-significant half of the larger operand). 
The simplest solution is to anticipate the correct sign; so in step A2 of Algorithm 
4.2.1A we will now assume not only that e, > e, but also that |u| > |v|. Then 
we can be sure that the final sign will be the sign of u. In other respects, double- 
precision addition is very much like its single-precision counterpart, except that 
everything needs to be done twice. 


Program A (Double-precision addition). The subroutine DFADD adds a double- 
precision floating point number v, having the form (1), to a double-precision 
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floating point number u, assuming that v is initially in rAX (registers A and X), 
and that u is initially stored in locations ACC and ACCX. The answer appears both 
in rAX and in (ACC, ACCX). The subroutine DFSUB subtracts v from u under the 
same conventions. 

Both input operands are assumed to be normalized, and the answer is 
normalized. The last portion of this program is a double-precision normalization 
procedure that is used by other subroutines of this section. Exercise 5 shows 
how to improve the program significantly. 


01 ABS EQU 1:5 Field definition for absolute value 
02 SIGN EQU 0:0 Field definition for sign 

03 EXPD EQU 1:2 Double-precision exponent field 
04 DFSUB STA TEMP Double-precision subtraction: 

05 LDAN TEMP Change sign of v. 

06 DFADD STJ EXITDF Double-precision addition: 

07 CMPA ACC(ABS) Compare |v| with [u]. 

08 JG 1F 

09 JL 2F 

10 CMPX ACCX (ABS) 

11 JLE 2F 

12 1H STA ARG If |v| > |u|, interchange u © v. 
13 STX ARGX 

14 LDA ACC 

15 LDX ACCX 

16 ENT1 ACC (ACC and ACCX are in consecutive 
17 MOVE ARG(2) locations. ) 

18 2H STA TEMP 

19 LD1N TEMP(EXPD) rll <¢ —e,. 

20 LD2 ACC(EXPD) rI2 ¢ eu. 

21 INC1 0,2 rll ¢+ eu — ey. 

22 SLAX 2 Remove exponent. 

23 SRAX 1,1 Scale right. 

24 STA ARG 0 U1 V2 V3 V4 

25 STX ARGX U5 U6 U7 Vg Vg 

26 STA ARGX(SIGN) Store true sign of v in both halves. 
27 LDA ACC (We know that u has the sign of the answer.) 
28 LDX ACCX rAX & u. 

29 SLAX 2 Remove exponent. 

30 STA ACC U1 U2 UZ U4 U5 

31 SLAX 4 

32 ENTX 1 

33 STX EXPO EXPO < 1 (see below). 

34 SRC 1 1 Us ug U7 Ug 

35 STA 1F(SIGN) A trick, see comments in text. 

36 ADD ARGX (0:4) Add 0 U5 U6 U7 Vg. 

37 SRAX 4 

38 1H DECA 1 Recover from inserted 1. (Sign varies) 
39 ADD ACC(0:4) Add most significant halves. 

40 ADD ARG (Overflow cannot occur) 
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41 DNORM JANZ 1F Normalization routine: 

42 JXNZ 1F fw in rAX, ew = EXPO + r12. 
43 DZERO STA ACC If fw = 0, set ew + 0. 

44 JMP OF 

45 2H SLAX 1 Normalize to the left. 

46 DEC2 1 

47 1H CMPA =0=(1:1) Is the leading byte zero? 
48 JE 2B 

49 SRAX 2 (Rounding omitted) 

50 STA ACC 

51 LDA EXPO Compute final exponent. 
52 INCA 0,2 

53 JAN EXPUND Is it negative? 

54 STA ACC(EXPD) 

55 CMPA =1(3:3)= Is it more than two bytes? 
56 JL 8F 


57 EXPOVD HLT 20 
58 EXPUND HLT 10 


59 8H LDA ACC Bring answer into rA. 

60 9H STX ACCX 

61 EXITDF JMP * Exit from subroutine. 

62 ARG CON O 

68 ARGX CON 0O 

64 ACC CON O Floating point accumulator 
65 ACCX CON O 

66 EXPO CON O Part of “raw exponent” J 


When the least-significant halves are added together in this program, an 
extra digit “1” is inserted at the left of the word that is known to have the 
correct sign. After the addition, this byte can be 0, 1, or 2, depending on 
the circumstances, and all three cases are handled simultaneously in this way. 
(Compare this with the rather cumbersome method of complementation that is 
used in Program 4.2.1A.) 

It is worth noting that register A can be zero after the instruction on line 40 
has been performed; and, because of the way MIX defines the sign of a zero result, 
the accumulator contains the correct sign that is to be attached to the result if 
register X is nonzero. If lines 39 and 40 were interchanged, the program would 
be incorrect, even though both instructions are ‘ADD’! 


Now let us consider double-precision multiplication. The product has four 
components, shown schematically in Fig. 4. Since we need only the leftmost 
eight bytes, it is convenient to ignore the digits to the right of the vertical line 
in the diagram; in particular, we need not even compute the product of the two 
least-significant halves. 


Program M (Double-precision multiplication). The input and output conven- 
tions for this subroutine are the same as for Program A. 


01 BYTE EQU 1(4:4) Byte size 
02 QQ EQU BYTE*BYTE/2 Excess of double-precision exponent 
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uuuuu uuu00 =u, teu 

vouvvv vuvv00 =v, +e 

xcxxnxnn £0000 = eu xv 

crrr xvwaxxr0d0 = EUm X UI 
crexrnzrlc xvwxrxr0d0 = EUL X Um 
TLTLELTLCT OL ele = Um X Um 


wwwww wwwww wwuwww w0000 


Fig. 4. Double-precision multiplication of eight-byte fraction parts. 


03 DFMUL STJ EXITDF Double-precision multiplication: 
04 STA TEMP 

05 SLAX 2 Remove exponent. 

06 STA ARG Um 

O7 STX ARGX UI 

08 LDA TEMP(EXPD) 

09 ADD ACC (EXPD) 

10 STA EXPO EXPO + eu + ey. 

11 ENT2 -QQ rI2 + —QQ. 

12 LDA ACC 

13 LDX ACCX 

14 SLAX 2 Remove exponent. 

15 STA ACC Um 

16 STX ACCX Ul 

17 MUL ARGX Um X vi 

18 STA TEMP 

19 LDA ARG(ABS) 

20 MUL ACCX(ABS) [um X u| 

21 SRA 1 OrrrT 

22 ADD TEMP(1:4) (Overflow cannot occur) 
23 STA TEMP 

24 LDA ARG 

25 MUL ACC Um X Um 

26 STA TEMP(SIGN) Store true sign of result. 
27 STA ACC Now prepare to add all the 
28 STX ACCX partial products together. 
29 LDA ACCX(0:4) Orr 

30 ADD TEMP (Overflow cannot occur) 
31 SRAX 4 

32 ADD ACC (Overflow cannot occur) 
33 JMP DNORM Normalize and exit. I 


Notice the careful treatment of signs in this program, and note also the fact 
that the range of exponents makes it impossible to compute the final exponent 
using an index register. Program M is perhaps too slipshod in accuracy, since it 
uses only the information to the left of the vertical line in Fig. 4; this can make 
the least significant byte as much as 2 in error. A little more accuracy can be 
achieved as discussed in exercise 4. 
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Double-precision floating division is the most difficult routine, or at least the 
most frightening prospect we have encountered so far in this chapter. Actually, 
it is not terribly complicated, once we see how to do it; let us write the numbers 
to be divided in the form (um + €uz)/(Um + evı), where € is the reciprocal of 
the word size of the computer, and where Um is assumed to be normalized. The 
fraction can now be expanded as follows: 


Um + EUL Um + €U ( 1 ) 
Um + €v Pe 1+ €(u/Um) 


me (ae) e 


Since 0 < |u| < 1 and 1/b < |um| < 1, we have |v/um| < b, and the error 
from dropping terms involving €? can be disregarded. Our method therefore is 
to compute Wm + ew, = (Um + Eu) /Um, and then to subtract € times wy /Um 
from the result. 

In the following program, lines 27-32 do the lower half of a double-precision 
addition, using another method for forcing the appropriate sign as an alternative 
to the trick of Program A. 


iS 


Program D (Double-precision division). This program adheres to the same 
conventions as Programs A and M. 


01 DFDIV STJ EXITDF Double-precision division: 


02 JOV OFLO Ensure that overflow is off. 

03 STA TEMP 

04 SLAX 2 Remove exponent. 

05 STA ARG Um 

06 STX ARGX UI 

07 LDA ACC(EXPD) 

08 SUB TEMP(EXPD) 

09 STA EXPO EXPO < eu — ey. 

10 ENT2 QQ+1 rI2 + QQ+ 1. 

11 LDA ACC 

12 LDX ACCX 

13 SLAX 2 Remove exponent. 

14 SRAX 1 (See Algorithm 4.2.1M) 

15 DIV ARG If overflow, it is detected below. 
16 STA ACC Wm 

17 SLAX 5 Use remainder in further division. 
18 DIV ARG 

19 STA ACCX +w; 

20 LDA ARGX(1:4) 

21 ENTX O 

22 DIV ARG(ABS) rA & ||bfvi/vm|]/b. 

23 JOV DVZROD Did division cause overflow? 

24 MUL ACC(ABS) TAX + |wmv/bum]|, approximately. 
25 SRAX 4 Multiply by b, and save 

26 SLC 5 the leading byte in rX. 
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27 SUB ACCX(ABS) Subtract |w]. 

28 DECA 1 Force minus sign. 

29 SUB WM1 

30 JOV *+2 If no overflow, carry one more 
31 INCX 1 to upper half. 

32 SLC 5 (Now rA < 0) 

33 ADD ACC(ABS) tA © |wm| — |rAl. 

34 STA ACC(ABS) (Now rA > 0) 

35 LDA ACC rA © wm with correct sign. 
36 JMP DNORM Normalize and exit. 

37 DVZROD HLT 30 Unnormalized or zero divisor 
38 1H EQU 1(1:1) 


39 WM1 CON 1B-1,BYTE-1(1:1) Word size minus one J 


Here is a table of the approximate average computation times for these 
double-precision subroutines, compared to the single-precision subroutines that 
appear in Section 4.2.1: 


Single precision Double precision 
Addition 45.5u 84u 
Subtraction 49.5u 88u 
Multiplication 48u 109u 
Division 52u 126.5u 


For extension of the methods of this section to triple-precision floating point 
fraction parts, see Y. Ikebe, CACM 8 (1965), 175-177. 


EXERCISES 


1. [16] Try the double-precision division technique by hand, with € = Ia: when di- 
viding 180000 by 314159. (Thus, let (um, ur) = (.180, .000) and (Um, v1) = (.314, .159), 
and find the quotient using the method suggested in the text following (2).) 

2. [20] Would it be a good idea to insert the instruction ‘ENTX 0’ between lines 30 
and 31 of Program M, in order to keep unwanted information left over in register X 
from interfering with the accuracy of the results? 


3. [M20] Explain why overflow cannot occur during Program M. 


4. [22] How should Program M be changed so that extra accuracy is achieved, 
essentially by moving the vertical line in Fig. 4 over to the right one position? Specify 
all changes that are required, and determine the difference in execution time caused by 
these changes. 


5. [24] How should Program A be changed so that extra accuracy is achieved, essen- 
tially by working with a nine-byte accumulator instead of an eight-byte accumulator 
to the right of the radix point? Specify all changes that are required, and determine 
the difference in execution time caused by these changes. 


6. [23] Assume that the double-precision subroutines of this section and the single- 
precision subroutines of Section 4.2.1 are being used in the same main program. Write a 
subroutine that converts a single-precision floating point number into double-precision 
form (1), and write another subroutine that converts a double-precision floating point 
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number into single-precision form (reporting exponent overflow or underflow if the 
conversion is impossible). 


7. [M30] Estimate the accuracy of the double-precision subroutines in this section, 
by finding bounds 61, 62, and 63 on the relative errors 


J| [(u@v) -= (ux v))/(u x v)|, 
|((u@ v) — (u/v))/(u/v)] . 


8. [M28] Estimate the accuracy of the “improved” double-precision subroutines of 
exercises 4 and 5, in the sense of exercise 7. 


9. [M42] T. J. Dekker [Numer. Math. 18 (1971), 224-242] has suggested an alter- 
native approach to double precision, based entirely on single-precision floating binary 
calculations. For example, Theorem 4.2.2C states that u +v = w+ r, where w = uv 
and r = (u O w) v, if |u| > |v| and the radix is 2; here |r| < |w|/2?, so the pair 
(w,r) may be considered a double-precision version of u + v. To add two such pairs 
(u, u’) @ (v, v’), where |u'| < |uļ/2P and |v’| < |v|/2? and |u| > |v|, Dekker suggests 
computing u +v = w+ r (exactly), then s = (r $ v’) Gu’ (an approximate remainder), 
and finally returning the value (w @ s, (w © (w @ 8)) @s). 

Study the accuracy and efficiency of this approach when it is used recursively to 
produce quadruple-precision calculations. 


— 
~ 
g 
S 
7 
~ 
g 
S 
aoe 
Ne 
™S 
— 
Ss 
S 


4.2.4. Distribution of Floating Point Numbers 


In order to analyze the average behavior of floating point arithmetic algorithms 
(and in particular to determine their average running time), we need some 
statistical information that allows us to determine how often various cases arise. 
The purpose of this section is to discuss the empirical and theoretical properties 
of the distribution of floating point numbers. 


A. Addition and subtraction routines. The execution time for a floating 
point addition or subtraction depends largely on the initial difference of expo- 
nents, and also on the number of normalization steps required (to the left or to 
the right). No way is known to give a good theoretical model that tells what 
characteristics to expect, but extensive empirical investigations have been made 
by D. W. Sweeney [IBM Systems J. 4 (1965), 31-42]. 

By means of a special tracing routine, Sweeney ran six “typical” large-scale 
numerical programs, selected from several different computing laboratories, and 
examined each floating addition or subtraction operation very carefully. Over 
250,000 floating point addition-subtractions were involved in gathering this data. 
About one out of every ten instructions executed by the tested programs was 
either FADD or FSUB. 

Subtraction is the same as addition preceded by negating the second operand, 
so we can give all the statistics as if we were merely doing addition. Sweeney’s 
results can be summarized as follows: 

One of the two operands to be added was found to be equal to zero about 
9 percent of the time, and this was usually the accumulator (ACC). The other 
91 percent of the cases split about equally between operands of the same or of 
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Table 1 
EMPIRICAL DATA FOR OPERAND ALIGNMENTS BEFORE ADDITION 


leu — ev] b=2 b=10 b=16 b = 64 
0 0.33 0.47 0.47 0.56 
1 0.12 0.23 0.26 0.27 
2 0.09 0.11 0.10 0.04 
3 0.07 0.03 0.02 0.02 
4 0.07 0.01 0.01 0.02 
5 0.04 0.01 0.02 0.00 
over 5 0.28 0.13 0.11 0.09 
average 3.1 0.9 0.8 0.5 

Table 2 


EMPIRICAL DATA FOR NORMALIZATION AFTER ADDITION 


b=2 b= 10 b= 16 b= 64 
Shift right 1 0.20 0.07 0.06 0.03 
No shift 0.59 0.80 0.82 0.87 
Shift left 1 0.07 0.08 0.07 0.06 
Shift left 2 0.03 0.02 0.01 0.01 
Shift left 3 0.02 0.00 0.01 0.00 
Shift left 4 0.02 0.01 0.00 0.01 
Shift left > 4 0.06 0.02 0.02 0.02 


opposite signs, and about equally between cases where |u| < |v] or |u| < |u|. The 
computed answer was zero about 1.4 percent of the time. 

The difference between exponents had a behavior approximately given by 
the probabilities shown in Table 1, for various radices b. (The “over 5” line of 
that table includes essentially all of the cases when one operand was zero, but 
the “average” line does not include these cases.) 

When u and v have the same sign and are normalized, then u + v either 
requires one shift to the right (for fraction overflow), or no normalization shifts 
whatever. When u and v have opposite signs, we have zero or more left shifts 
during the normalization. Table 2 gives the observed number of shifts required; 
the last line of that table includes all cases where the result was zero. The 
average number of left shifts per normalization was about 0.9 when b = 2; about 
0.2 when b = 10 or 16; and about 0.1 when b = 64. 


B. The fraction parts. Further analysis of floating point routines can be based 
on the statistical distribution of the fraction parts of randomly chosen normalized 
floating point numbers. The facts are quite surprising, and there is an interesting 
theory that accounts for the unusual phenomena that are observed. 

For convenience let us assume temporarily that we are dealing with floating 
decimal arithmetic (radix 10); modifications of the following discussion to any 
other positive integer base b will be very straightforward. Suppose we are given 
a “random” positive normalized number (e, f) = 10° - f. Since f is normalized, 
we know that its leading digit is 1, 2, 3, 4, 5, 6, 7, 8, or 9, and we might naturally 
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expect each of these nine possible leading digits to occur about one-ninth of the 
time. But, in fact, the behavior in practice is quite different. For example, the 
leading digit tends to be equal to 1 more than 30 percent of the time! 

One way to test the assertion just made is to take a table of physical con- 
stants (like the speed of light or the acceleration of gravity) from some standard 
reference. If we look at the Handbook of Mathematical Functions (U.S. Dept of 
Commerce, 1964), for example, we find that 8 of the 28 different physical con- 
stants given in Table 2.3, roughly 29 percent, have leading digit equal to 1. The 
decimal values of n! for 1 < n < 100 include exactly 30 entries beginning with 1; 
so do the decimal values of 2” and of Fn, for 1 < n < 100. We might also try look- 
ing at census reports, or a Farmer’s Almanack (but not a telephone directory). 

In the days before pocket calculators, the pages in well-used tables of loga- 
rithms tended to get quite dirty in the front, while the last pages stayed relatively 
clean and neat. This phenomenon was apparently first mentioned in print by 
the astronomer Simon Newcomb [Amer. J. Math. 4 (1881), 39-40], who gave 
good grounds for believing that the leading digit d occurs with probability 
log;)(1 + 1/d). The same distribution was discovered empirically, many years 
later, by Frank Benford, who reported the results of 20,229 observations taken 
from many different sources [Proc. Amer. Philosophical Soc. 78 (1938), 551-572]. 

In order to account for this leading-digit law, let’s take a closer look at 
the way we write numbers in floating point notation. If we take any positive 
number u, its fraction part is determined by the formula 10f,, = 100810“) mod 1. 
hence its leading digit is less than d if and only if 


(logy) u) mod 1 < logy) d. (1) 


Now if we have a “random” positive number U, chosen from some reasonable 
distribution that might occur in nature, we might expect that (log1ọ U) mod 1 
would be uniformly distributed between zero and one, at least to a very good 
approximation. (Similarly, we expect U mod 1, U? mod 1, VU + m mod 1, etc., 
to be uniformly distributed. We expect a roulette wheel to be unbiased, for essen- 
tially the same reason.) Therefore by (1) the leading digit will be 1 with probabil- 
ity log,g 2 ~ 30.103 percent; it will be 2 with probability log,  3—log;, 2 ~ 17.609 
percent; and, in general, if r is any real value between 1 and 10, we ought to 
have 10 fy < r approximately logįo r of the time. 

The fact that leading digits tend to be small makes the most obvious tech- 
niques of “average error” estimation for floating point calculations invalid. The 
relative error due to rounding is usually a little more than expected. 

Of course, it may justly be said that the heuristic argument above does 
not prove the stated law. It merely shows us a plausible reason why the leading 
digits behave the way they do. An interesting approach to the analysis of leading 
digits has been suggested by R. Hamming: Let p(r) be the probability that 
10fyu <r, where 1 < r < 10 and fy is the normalized fraction part of a random 
normalized floating point number U. If we think of random quantities in the real 
world, we observe that they are measured in terms of arbitrary units; and if we 
were to change the definition of a meter or a gram, many of the fundamental 
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physical constants would have different values. Suppose then that all of the 
numbers in the universe are suddenly multiplied by a constant factor c; our 
universe of random floating point quantities should be essentially unchanged by 
this transformation, so p(r) should not be affected. 

Multiplying everything by c has the effect of transforming (log,, U) mod 1 
into (log, ) U + logio c) mod 1. It is now time to set up formulas that describe 
the desired behavior; we may assume that 1 < c < 10. By definition, 


p(r) = Pr((log;g U) mod 1 < logygr). 
By our assumption, we should also have 
p(r) = Pr((logyy U + logio c) mod 1 < logio r) 


Pr((log1o U mod 1) < logy) r — logy € 
or (logo U mod 1) > 1 — logic), ife <r; 
Pr((logy) U mod 1) < logygr + 1 — logio € 
and (logio U mod 1) > 1 — logo), ife>r; 


ee ifce<r; (2) 
p(10r/c)— p(10/c), ife>r. 


Let us now extend the function p(r) to values outside the range 1 < r < 10, by 
defining p(10”r) = p(r)+n; then if we replace 10/c by d, the last equation of (2) 
may be written 

p(rd) = p(r) + p(d). (3) 
If our assumption about invariance of the distribution under multiplication by 
a constant factor is valid, then Eq. (3) must hold for all r > 0 and 1 < d < 10. 
The facts that p(1) = 0 and p(10) = 1 now imply that 


1 = p(10) = p((V10)") = p(V'10) + p((V10)"™*) = --- = np(V10); 
hence we deduce that p(10™/”) = m/n for all positive integers m and n. If 
we now decide to require that p is continuous, we are forced to conclude that 
p(r) = log) r, and this is the desired law. 

Although this argument may be more convincing than the first one, it doesn’t 
really hold up under scrutiny if we stick to conventional notions of probability. 
The traditional way to make the argument above rigorous is to assume that 
there is some underlying distribution of numbers F(u) such that a given positive 
number U is < u with probability F(u); then the probability of concern to us is 


p(r) = >> (F(0™r) — F0™)), (4) 
summed over all values —co < m < oo. Our assumptions about scale invariance 
and continuity have led us to conclude that 


P(r) = logio r. 
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Using the same argument, we could “prove” that 


SO (F(b"r) — F(b™)) = logy r, (5) 
m 
for each integer b > 2, when 1 < r < b. But there is no distribution function F 
that satisfies this equation for all such b and r! (See exercise 7.) 

One way out of the difficulty is to regard the logarithm law p(r) = log) r as 
only a very close approximation to the true distribution. The true distribution 
itself may perhaps be changing as the universe expands, becoming a better and 
better approximation as time goes on; and if we replace 10 by an arbitrary 
base b, the approximation might be less accurate (at any given time) as b gets 
larger. Another rather appealing way to resolve the dilemma, by abandoning the 
traditional idea of a distribution function, has been suggested by R. A. Raimi, 
AMM 76 (1969), 342-348. 

The hedging in the last paragraph is probably a very unsatisfactory ex- 
planation, and so the following further calculation (which sticks to rigorous 
mathematics and avoids any intuitive, yet paradoxical, notions of probability) 
should be welcome. Let us consider the distribution of the leading digits of 
the positive integers, instead of the distribution for some imagined set of real 
numbers. The investigation of this topic is quite interesting, not only because 
it sheds some light on the probability distributions of floating point data, but 
also because it makes a particularly instructive example of how to combine the 
methods of discrete mathematics with the methods of infinitesimal calculus. 

In the following discussion, let r be a fixed real number, 1 < r < 10; we 
will attempt to make a reasonable definition of p(r), the “probability” that the 
representation 10°’ - fy of a “random” positive integer N has 10 fy < r, assuming 
infinite precision. 

To start, let us try to find the probability using a limiting method like the 
definition of “Pr” in Section 3.5. One nice way to rephrase that definition is to 
define 


P(n) = [n = 10°- f where 10f <r] = [(logyyn) mod 1 < logor]. (6) 


Now Po(1), P (2), ... is an infinite sequence of zeros and ones, with ones to 
represent the cases that contribute to the probability we are seeking. We can 
try to “average out” this sequence, by defining 


Pi(n) = 5 Po(k). (7) 


Thus if we generate a random integer between 1 and n using the techniques of 
Chapter 3, and convert it to floating decimal form (e, f), the probability that 
10f < r is exactly P,(n). It is natural to let limpo Pi(n) be the “probability” 
p(r) we are after, and that is just what we did in Definition 3.5A. 
But in this case the limit does not exist. For example, let us consider the 
subsequence 
P,(s), P,(10s), P,(100s), ..., P,(10"s),..., 
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where s is a real number, 1 < s < 10. If s < r, we find that 


1 
P,(10"s) = ({r] —1+ [10r] —10+---+[10"~'r] —10"~! + |10"s] +1—10") 


~ 10s 
eT ee ee a 
= T A r—10”*+)+[10 s|+0(n)). (8) 


As n — co, P,(10"s) therefore approaches the limiting value 1+ (r—10)/9s. The 
same calculation is valid for the case s > r if we replace |10"s| + 1 by [107]; 
thus we obtain the limiting value 10(r — 1)/9s when s > r. [See J. Franel, 
Naturforschende Gesellschaft, Vierteljahrsschrift 62 (Zürich: 1917), 286—295.] 

In other words, the sequence (P;(n)) has subsequences (P;(10"s)) whose 
limit goes from (r — 1)/9 up to 10(r — 1)/9r and down again to (r — 1)/9, as 
s goes from 1 tor to 10. We see that P; (n) has no limit as n — oo; and the values 
of P(n) for large n are not particularly good approximations to our conjectured 
limit log,, r either! 

Since P; (n) doesn’t approach a limit, we can try to use the same idea as (7) 
once again, to “average out” the anomalous behavior. In general, let 


Prss(tt) = ŻY Palk). (9) 
k=] 


Then P,,+41(n) will tend to be a more well-behaved sequence than P,,(n). Let us 
try to confirm this with quantitative calculations; our experience with the special 
case m = 0 indicates that it might be worthwhile to consider the subsequence 
Pm+1(10"s). The following results can, in fact, be derived: 


Lemma Q. For any integer m > 1 and any real number e > 0, there are 
functions Qm(s), Rm(s) and an integer Nm(€), such that whenever n > Nm(€) 
and 1 < s < 10, we have 


| Pm(10"s) — Qm(s) — Rm(s)[s>r] |<. (10) 
Furthermore the functions Qm(s) and Rm(s) satisfy the relations 
1 1 10 S 1 10 
Qm(s) = -(5 Qm-_-1(t) a+ f Qm-_1(t) dt + 5J Rm-1(t) dt); 
Ss 9 1 1 9 r 
1 sS 
Rm(s) = -/ Rm_1(t) dt; (11) 


Qo(s) =1, Ro(s) =-—1. 
Proof. Consider the functions Qm(s) and R,,(s) defined by (11), and let 
Sim(t) = Qm(t) + Rmt) [t >r]. (12) 


We will prove the lemma by induction on m. 
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First note that Qi(s) = (1 + (s — 1) —(10—r)/9)/s = 1 + (r — 10)/9s, and 
Rı(s) = (r — s)/s. From (8) we find that |P,(10"s) — S1(s)| = O(n)/10”; this 
establishes the lemma when m = 1. 

Now for m > 1, we have 


E VRIE! 1 1 1 
Pm (10 y= 3 5 10n- 5 igm) + 5 Pa )); 


O<j<n 103<k<1034 10"<k<10"s 


and we want to approximate this quantity. By induction, the difference 


1 1 k 
D prt- D glg) 


10i<k<10iq 10i<k<10jq 


(13) 


is less than ge when 1 < q < 10 and j > Nm-1(€). Since Sm-1(t) is continuous, 
it is a Riemann-integrable function; and the difference 


1 k a 
EC a! aaa = m— d 
j (a) | Pane 


is less than e€ for all j greater than some number N, independent of q, by the 
definition of integration. We may choose N to be > Nm_i(e). Therefore for 
n > N, the difference 


(14) 


10i<k<10iq 


10 s 
Pato) -E ( 5 -a Snide f SnO at) (15) 


0<j<n 


is bounded by > eto) + Vinejen(Lle/10"-7) + 11e, if M is an upper 
bound for (13) + (14) that is valid for all positive integers j. Finally, the sum 
Mo<jen(1/10"’), which appears in (15), is equal to (1 — 1/10”)/9; so 


1 10 


P(10"s) — (5 Sin_1(t) at f Sin_1(t) it) 


1 


can be made smaller than, say, 20e, if n is taken large enough. Comparing this 
with (10) and (11) completes the proof. J 


The gist of Lemma Q is that we have the limiting relationship 
lim P,,(10"s) = Sin(s). (16) 
n— o0 
Also, since Sm(s) is not constant as s varies, the limit 
A 
(which would be our desired “probability”) does not exist for any m. The 


situation is shown in Fig. 5, which shows the values of Sm(s) when m is small 
and r = 2. 
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Fig. 5. The probability that the leading digit is 1. 


Even though Sm(s) is not a constant, so that we do not have a definite limit 
for Pm(n), notice that already for m = 3 in Fig. 5 the value of Sm(s) stays very 
close to logo 2 ~ 0.30103. Therefore we have good reason to suspect that Sm(s) 
is very close to logįo r for all large m, and, in fact, that the sequence of functions 
(Sm(s)) converges uniformly to the constant function logy, r. 

It is interesting to prove this conjecture by explicitly calculating Qm(s) and 
Rm(s) for all m, as in the proof of the following theorem: 


Theorem F. Let Sm(s) be the limit defined in (16). For all e > 0, there exists 
a number N(e) such that 


|Sm(s) —logigr| < e, for 1 < s < 10, (17) 
whenever m > N (e). 


Proof. In view of Lemma Q, we can prove this result if we can show that there 

is a number M depending on € such that, for 1 < s < 10 and for all m > M, we 
have 

|Qm(s) — logior| < € and |Rin(s)| < €. (18) 

It is not difficult to solve the recurrence formula (11) for Rm: We have 

Ro(s) = —1, Ri(s) = —1 +r/s, Rə(s) = —1 + (r/s)(1 +ln(s/r)), and in general 


AE “(1 | l Ino zoey"). (19) 


For the stated range of s, this converges uniformly to —1+(r/s) exp(In(s/r)) = 0. 
The recurrence (11) for Qm takes the form 


Gage (cn + 14+ | rað d), en 


S 


where 


1 10 10 
Cm = ( i Qm-1(¢) dt +f Rm-1(t) it) -1. (21) 
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And the solution to recurrence (20) is easily found by trying out the first few 
cases and guessing at a formula that can be proved by induction; we find that 


Qm(s) =14+ L (en + Ta ETENA gooh ge) : (22) 


It remains for us to calculate the coefficients cm, which by (19), (21), and 
(22) satisfy the relations 


cı = (r — 10)/9; 


1 1 1 
Cm+1 = 5 (cn In 10 + zj Cm—1 (In 10)? +--+ zela 10)” 


Hr EEE EN (in) 10}. 
1! r m! r 


This sequence appears at first to be very complicated, but actually we can 
analyze it without difficulty with the help of generating functions. Let 


C(z) = cz + coz? + gz? ++; 
then since 10? = 1+ zln 10 + (1/2!)(zIn 10)? + ---, we deduce that 


1 
Cm+1 = Toet T gert 
1 1 m\, rf , 1%, 10y™ 
= -y (cmt tem In 10+ . += (m10) ) tio (: aaron (in m ) ) 1 
is the coefficient of z+! in the function 
1 r (10\7 Z Z 
— 107 4 : 
Thee oe) (=) l-z (24) 
This condition holds for all values of m, so (24) must equal C(z), and we obtain 


the explicit formula 


—z ((10/r)*1-1 
ow = = ( 107-1 —1 ) (25) 


We want to study asymptotic properties of the coefficients of C(z), to complete 
our analysis. The large parenthesized factor in (25) approaches In(10/r)/In 10 = 
1 —logyigr as z + 1, so we see that 


1 — logio” 


C(z) + = 


= R(z) (26) 
is an analytic function of the complex variable z in the circle 


201 


In 10 


|z| < J1+ 
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In particular, R(z) converges for z = 1, so its coefficients approach zero. This 
proves that the coefficients of C(z) behave like those of (logor — 1)/(1 — z), 
that is, 


lim cm = log; )r—1. 
m—- co 


Finally, we may combine this with (22), to show that Q,,(s) approaches 


logigr —1 


1 
1+ (1+ms+ 5(Ins)?+--) = logigr 


8 
uniformly for 1 <s < 10. J 


Therefore we have established the logarithmic law for integers by direct 
calculation, at the same time seeing that it is an extremely good approximation 
to the average behavior although it is never precisely achieved. 

The proofs of Lemma Q and Theorem F given above are slight simplifica- 
tions and amplifications of methods due to B. J. Flehinger, AMM 73 (1966), 
1056-1061. Many authors have written about the distribution of initial digits, 
showing that the logarithmic law is a good approximation for many underlying 
distributions; see the surveys by Ralph A. Raimi, AMM 83 (1976), 521-538, and 
Peter Schatte, J. Information Processing and Cybernetics 24 (1988), 443-455, 
for a comprehensive review of the literature. 

Exercise 17 discusses an approach to the definition of probability under 
which the logarithmic law holds exactly, over the integers. Furthermore, ex- 
ercise 18 demonstrates that any reasonable definition of probability over the 
integers must lead to the logarithmic law, if it assigns a value to the probability 
of leading digits. 

Floating point computations operate primarily on noninteger numbers, of 
course; we have studied integers because of their familiarity and their simplic- 
ity. When arbitrary real numbers are considered, theoretical results are more 
difficult to obtain, but evidence is accumulating that the same statistics apply, 
in the sense that repeated calculations with real numbers will nearly always 
tend to yield better and better approximations to a logarithmic distribution of 
fraction parts. For example, Peter Schatte [Zeitschrift für angewandte Math. 
und Mechanik 53 (1973), 553-565] showed that, under mild restrictions, the 
products of independent, identically distributed random real variables approach 
the logarithmic distribution. The sums of such variables do too, but only in the 
sense of repeated averaging. Similar results have been obtained by J. L. Barlow 
and E. H. Bareiss, Computing 34 (1985), 325-347. See also A. Berger, L. A. 
Bunimovich, and T. P. Hill, Trans. Amer. Math. Soc. 357 (2004), 197-219. 


EXERCISES 

1. [13] Given that u and v are nonzero floating decimal numbers with the same sign, 
what is the approximate probability that fraction overflow occurs during the calculation 
of u@ v, according to Tables 1 and 2? 

2. [42] Make further tests of floating point addition and subtraction, to confirm or 
improve on the accuracy of Tables 1 and 2. 
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3. [15] What is the probability that the two leading digits of a floating decimal 
number are “23”, according to the logarithmic law? 


4. [M18] The text points out that the front pages of a well-used table of logarithms 
get dirtier than the back pages do. What if we had an antilogarithm table instead, 
namely a table that tells us the value of x when log,, x is given; which pages of such a 
table would be the dirtiest? 


5. [M20] Let U bea random real number that is uniformly distributed in the interval 
0<U <1. What is the distribution of the leading digits of U? 


6. [23] If we have binary computer words containing n + 1 bits, we might use p 
bits for the fraction part of floating binary numbers, one bit for the sign, and n — p 
bits for the exponent. This means that the range of values representable, namely the 
ratio of the largest positive normalized value to the smallest, is essentially 2?” "| The 
same computer word could be used to represent floating heradecimal numbers, that is, 
floating point numbers with radix 16, with p + 2 bits for the fraction part ((p + 2)/4 
hexadecimal digits) and n — p — 2 bits for the exponent; then the range of values would 
be 162777? = ga the same as before, and with more bits in the fraction part. This 
may sound as if we are getting something for nothing, but the normalization condition 
for base 16 is weaker in that there may be up to three leading zero bits in the fraction 
part; thus not all of the p + 2 bits are “significant.” 

On the basis of the logarithmic law, what are the probabilities that the fraction 
part of a positive normalized radix 16 floating point number has exactly 0, 1, 2, and 3 
leading zero bits? Discuss the desirability of hexadecimal versus binary. 


7. [HM28] Prove that there is no distribution function F(u) that satisfies (5) for 
each integer b > 2, and for all real values r in the range 1 < r < b. 


8. [HM23] Does (10) hold when m = 0 for suitable No(e)? 


9. [HM25] (P. Diaconis.) Let Pı (n), P2(n), ... be any sequence of functions defined 
by repeatedly averaging a given function Po(n) according to Eq. (9). Prove that 
limm-— oo Pm(n) = Po(1) for all fixed n. 


10. [HM28| The text shows that cm = logygr — 1 + €m, where €m approaches zero as 
m — co. Obtain the next term in the asymptotic expansion of cm. 


11. [M15] Given that U is a random variable distributed according to the logarithmic 
law, prove that 1/U is also. 


12. [HM25] (R. W. Hamming.) The purpose of this exercise is to show that the result 
of floating point multiplication tends to obey the logarithmic law more perfectly than 
the operands do. Let U and V be random, normalized, positive floating point numbers, 
whose fraction parts are independently distributed with the respective density functions 
f(x) and g(x). Thus, fu < r and fy < s with probability Sin Sin f(x)g(y) dy da, 
for 1/b < r,s < 1. Let h(x) be the density function of the fraction part of U x V 
(unrounded). Define the abnormality A(f) of a density function f to be the maximum 
relative error, 


f(z) — (2) 
I(x) 
where I(x) = 1/(x 1n b) is the density of the logarithmic distribution. 
Prove that A(h) < min(A(f), A(g)). (In particular, if either factor has logarithmic 
distribution the product does also.) 


? 


A(f) = 


max 
1/b<2<1 
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> 13. [M20] The floating point multiplication routine, Algorithm 4.2.1M, requires zero 
or one left shifts during normalization, depending on whether fuf, > 1/b or not. 
Assuming that the input operands are independently distributed according to the 
logarithmic law, what is the probability that no left shift is needed for normalization 
of the result? 


> 14. [HM30| Let U and V be random, normalized, positive floating point numbers 
whose fraction parts are independently distributed according to the logarithmic law, 
and let pp be the probability that the difference in their exponents is k. Assuming that 
the distribution of the exponents is independent of the fraction parts, give an equation 
for the probability that “fraction overflow” occurs during the floating point addition of 
U ® V, in terms of the base b and the quantities po, pı, p2, .... Compare this result 
with exercise 1. (Ignore rounding.) 


15. [HM28] Let U, V, po, pi, ... be as in exercise 14, and assume that radix 10 
arithmetic is being used. Show that regardless of the values of po, pı, p2, ..., the sum 
U V will not obey the logarithmic law exactly, and in fact the probability that U @V 
has leading digit 1 is always strictly less than log,, 2. 


16. [HM28] (P. Diaconis.) Let Po(n) be 0 or 1 for each n, and define “probabilities” 
Pm-+i(n) by repeated averaging, as in (9). Show that if limn.. Pi(n) does not exist, 
neither does limn—+oo Pm(n) for any m. [Hint: Prove that an — 0 whenever we have 
(ay +--+ +an)/n > 0 and an41 < an + M/n, for some fixed constant M > 0.] 


> 17. [HM25] (M. Tsuji.) Another way to define the value of Pr(S(n)) is to evaluate the 
quantity limn—+oo(Hy* Dg [S(k)]/k); it can be shown that this harmonic probability 
exists and is equal to Pr(S(n)), whenever the latter exists according to Definition 3.5A. 
Prove that the harmonic probability of the statement “(log, yn) mod 1 < r” exists and 
equals r. (Thus, initial digits of integers satisfy the logarithmic law ezactly in this 
sense. ) 


> 18. [HM30] Let P(S) be any real-valued function defined on sets S of positive integers, 
but not necessarily on all such sets, satisfying the following rather weak axioms: 
i) If P(S) and P(T) are defined and SN T =@, then P(S U T) = P(S) + P(T). 
ii) If P(S) is defined, then P(S +1) = P(S), where S+1={n+1|neS}. 
iii) If P(S) is defined, then P(2S) = 4 P(S), where 25 = {2n | n € S}. 
iv) If S is the set of all positive integers, then P(S) = 1. 
v) If P(S) is defined, then P(S) > 0. 
Assume furthermore that P(L,) is defined for all positive integers a, where La is the 
set of all integers whose decimal representation begins with a: 


La = {n | 10™a < n < 10™(a + 1) for some integer m} . 
(In this definition, m may be negative; for example, 1 is an element of Lio, but not 
of L11.) Prove that P(La) = log, 9(1+1/a) for all integers a > 1. 
19. [HM25] (R. L. Duncan.) Prove that the leading digits of Fibonacci numbers obey 
the logarithmic law of fraction parts: Pr(10fr, < r) = logio r- 
20. [HM40] Sharpen (16) by finding the asymptotic behavior of Pm(10"s) — Sm(s) as 
n —> oo. 
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4.3. MULTIPLE-PRECISION ARITHMETIC 


LET US NOW consider operations on numbers that have arbitrarily high precision. 
For simplicity in exposition, we shall assume that we are working with integers, 
instead of with numbers that have an embedded radix point. 


4.3.1. The Classical Algorithms 
In this section we shall discuss algorithms for 


a) addition or subtraction of n-place integers, giving an n-place answer and a 
carry; 

b) multiplication of an m-place integer by an n-place integer, giving an (m+n)- 
place answer; 

c) division of an (m+n)-place integer by an n-place integer, giving an (m+ 1)- 
place quotient and an n-place remainder. 


These may be called the classical algorithms, since the word “algorithm” was 
used only in connection with these processes for several centuries. The term 
“n-place integer” means any nonnegative integer less than b”, where b is the 
radix of ordinary positional notation in which the numbers are expressed; such 
numbers can be written using at most n “places” in this notation. 

It is a straightforward matter to apply the classical algorithms for integers 
to numbers with embedded radix points or to extended-precision floating point 
numbers, in the same way that arithmetic operations defined for integers in MIX 
are applied to these more general problems. 

In this section we shall study algorithms that do operations (a), (b), and (c) 
above for integers expressed in radix b notation, where b is any given integer 
that is 2 or more. Thus the algorithms are quite general definitions of arithmetic 
processes, and as such they are unrelated to any particular computer. But the 
discussion in this section will also be somewhat machine-oriented, since we are 
chiefly concerned with efficient methods for doing high-precision calculations by 
computer. Although our examples are based on the mythical MIX, essentially the 
same considerations apply to nearly every other machine. 

The most important fact to understand about extended-precision numbers 
is that they may be regarded as numbers written in radix w notation, where 
w is the computer’s word size. For example, an integer that fills 10 words on a 
computer whose word size is w = 101° has 100 decimal digits; but we will consider 
it to be a 10-place number to the base 101°. This viewpoint is justified for the 
same reason that we may convert, say, from binary to hexadecimal notation, 
simply by grouping the bits together. (See Eq. 4.1-(5).) 

In these terms, we are given the following primitive operations to work with: 
ag) addition or subtraction of one-place integers, giving a one-place answer and 

a carry; 
bo) multiplication of a one-place integer by another one-place integer, giving a 

two-place answer; 
co) division of a two-place integer by a one-place integer, provided that the 

quotient is a one-place integer, and yielding also a one-place remainder. 
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By adjusting the word size, if necessary, nearly all computers will have these three 
operations available; so we will construct algorithms (a), (b), and (c) mentioned 
above in terms of the primitive operations (ao), (bo), and (co). 

Since we are visualizing extended-precision integers as base b numbers, it is 
sometimes helpful to think of the situation when b = 10, and to imagine that 
we are doing the arithmetic by hand. Then operation (ao) is analogous to mem- 
orizing the addition table; (bọ) is analogous to memorizing the multiplication 
table; and (co) is essentially memorizing the multiplication table in reverse. The 
more complicated operations (a), (b), (c) on high-precision numbers can now 
be done using the simple addition, subtraction, multiplication, and long-division 
procedures that children are taught in elementary school. In fact, most of the 
algorithms we shall discuss in this section are essentially nothing more than 
mechanizations of familiar pencil-and-paper operations. Of course, we must 
state the algorithms much more precisely than they have ever been stated in 
the fifth grade, and we should also attempt to minimize computer memory and 
running time requirements. 

To avoid a tedious discussion and cumbersome notations, we shall assume 
first that all the numbers we deal with are nonnegative. The additional work 
of computing the signs, etc., is quite straightforward, although some care is 
necessary when dealing with complemented numbers on computers that do not 
use a signed magnitude representation. Such issues are discussed near the end 
of this section. 

First comes addition, which of course is very simple, but it is worth careful 
study since the same ideas occur also in the other algorithms. 


Algorithm A (Addition of nonnegative integers). Given nonnegative n-place 

integers (Un-1.-.U1Uo)p and (Un-—1 ...U1vo)b, this algorithm forms their radix-b 

sum, (WpWn—1--.W1Wo)p- Here wn is the carry, and it will always be equal to 

0 or 1. 

A1. [Initialize.] Set j + 0, k + 0. (The variable j will run through the various 
digit positions, and the variable k will keep track of carries at each step.) 

A2. [Add digits.] Set wj + (uj + vj +k) mod b, and k © |(uj +v; + k)/b]. (By 
induction on the computation, we will always have 


uj tou; +k < (b 1) + (b 1)+ 1< 2b. 


Thus k is being set to 1 or 0, depending on whether a carry occurs or not; 
equivalently, k + [u; + v; + k> b].) 


A3. [Loop on j.] Increase j by one. Now if j < n, go back to step A2; otherwise 
set Wn < k and terminate the algorithm. J 

For a formal proof that Algorithm A is valid, see exercise 4. 
A MIX program for this addition process might take the following form: 


Program A (Addition of nonnegative integers). Let LOC(u;) = U+ j, LOC(v;) = 
V +j, LOC(w;) = W+ j, rll=j—n,rA=k, word size = b, N= n. 
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01 ENN1 N 1 Al. Initialize. j 4+- 0. 
02 JOV OFLO 1 Ensure that overflow is off. 
0: 1H ENTA O N+1-K ke0. 
04 J1Z 3F N+1-—K Exit the loop if j =n. 
05 2H ADD U+N,1 N A2. Add digits. 
06 ADD V+N,1 N 
07 STA W+N,1 N 
08 INC1 1 N A3. Loop on j. j j +1. 
09 JNOV 1B N If no overflow, set k + 0. 
10 ENTA 1 K Otherwise, set k + 1. 
11 JiN 2B K To A2 ifj <n. 
1 


12 3H STA W+N Store final carry in wn. J 


The running time for this program is 10N +6 cycles, independent of the number 
of carries, K. The quantity K is analyzed in detail at the close of this section. 
Many modifications of Algorithm A are possible, and only a few of these are 
mentioned in the exercises below. A chapter on generalizations of this algorithm 
might be entitled “How to design addition circuits for a digital computer.” 
The problem of subtraction is similar to addition, but the differences are 
worth noting: 


Algorithm S (Subtraction of nonnegative integers). Given nonnegative n-place 
integers (Un—1---U1Uo)b > (Un—1---U1V0)o, this algorithm forms their nonneg- 


ative radix-b difference, (Wr—1 .. . W1W0)b- 

S1. [Initialize.] Set j + 0, k + 0. 

S2. [Subtract digits.] Set w; + (uj — vj + k) mod b, and k + |(u; — v; + k)/b]. 
(In other words, k is set to —1 or 0, depending on whether a borrow occurs 
or not, namely whether uj — vj + k < 0 or not. In the calculation of wj, we 
must have —b = 0 — (b — 1) + (—1) < uj — v; +k < (b— 1)— 0+0 < b; 
hence 0 < uj — vj + k +b < 2b, and this suggests the method of computer 
implementation explained below.) 


S3. [Loop on j.] Increase j by one. Now if j < n, go back to step S2; otherwise 
terminate the algorithm. (When the algorithm terminates, we should have 
k = 0; the condition k = —1 will occur if and only if (Un_1...v1v0)» > 
(un—1 -. . u1uo)b, contrary to the given assumptions. See exercise 12.) I 


In a MIX program to implement subtraction, it is most convenient to retain 
the value 1 + k instead of k throughout the algorithm, so that we can calculate 
uj — vj + (1+ k) + (b-— 1) in step S2. (Recall that b is the word size.) This is 
illustrated in the following code. 


Program S (Subtraction of nonnegative integers). This program is analogous 
to the code in Program A, but with rA = 1 + k. Here, as in other programs of 
this section, location WM1 contains the constant b — 1, the largest possible value 
that can be stored in a MIX word; see Program 4.2.3D, lines 38-39. 


01 ENN1 N 1 S1. Initialize. j < 0. 
02 JOV OFLO 1 Ensure that overflow is off. 
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03 1H J1iZ DONE K+1 Terminate if j =n. 

04 ENTA 1 K Set k + 0. 

05 2H ADD U+N,1 N S2. Subtract digits. 

06 SUB V+N,1 N Compute uj — vj + k + b. 
07 ADD WM1 N 

08 STA W+N,1 N (May be minus zero) 

09 INC1 1 N S3. Loop on j. j} j+1. 
10 JOV 1B N If overflow, set k + 0. 

11 ENTA O N— K Otherwise set k + —1. 
12 JiN 2B N-—K Back to 82 if j <n. 

13 HLT 5 (Error, v >u) I 


The running time for this program is 12N + 3 cycles, slightly longer than the 
corresponding amount for Program A. 

The reader may wonder if it would not be worthwhile to have a combined 
addition-subtraction routine in place of the two algorithms A and S. But an 
examination of the code shows that it is generally better to use two different 
routines, so that the inner loops of the computations can be performed as rapidly 
as possible, since the programs are so short. 

Our next problem is multiplication, and here we carry the ideas used in 
Algorithm A a little further: 


Algorithm M (Multiplication of nonnegative integers). Given nonnegative 
integers (um-—1 . . . u1uo)b and (Un—1 ... V1vo)b, this algorithm forms their radix-b 
product (Wm4n—-1---W1Wo)». (The conventional pencil-and-paper method is 
based on forming the partial products (Uum-—1 . - - u1uo)p X v; first, for 0 < j < n, 
and then adding these products together with appropriate scale factors; but in 
a computer it is simpler to do the addition concurrently with the multiplication, 
as described in this algorithm.) 
M1. [Initialize.] Set wm—1, Wm—2, ---, Wo all to zero. Set j + 0. (If those 
positions were not cleared to zero in this step, one can show that the steps 
below would set 


(Wmtn—1---Wo)b  (Um—1---U0)b X (Un—-1---Vo)o + (Wm-1 - . - wWo)b. 

This more general multiply-and-add operation is often useful.) 
M2. [Zero multiplier?] If vj = 0, set Wj+m < 0 and go to step M6. (This test 
might save time if there is a reasonable chance that vj is zero, but it may 
be omitted without affecting the validity of the algorithm.) 
M3. [Initialize i.] Set i + 0, k + 0. 
M4. [Multiply and add.] Set t + u; x vj + wi4; + k; then set wi+j < t mod b 
and k + |t/b|. (Here the carry k will always be in the range 0 < k < b; 
see below.) 


MB. [Loop on i.] Increase i by one. Now if i < m, go back to step M4; otherwise 
set Wj4m + K. 


M6. [Loop on j.] Increase j by one. Now if j < n, go back to step M2; otherwise 
the algorithm terminates. J 
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Table 1 
MULTIPLICATION OF 914 BY 84 
Step i j Ui vj t w4 w3 We Wi wo 
M5 0 0 4 4 16 0 0 6 
M5 1 0 1 4 05 0 5 6 
M5 2 0 9 4 36 : 6 5 6 
M6 3 0 ; 4 36 3 6 5 6 
M5 0 1 4 8 3T 3 6 T 6 
M5 1 1 1 8 17 3 T T 6 
M5 2 1 9 8 76 : 6 7 7 6 
M6 3 1 8 76 7 6 7 T 6 


Algorithm M is illustrated in Table 1, assuming that b = 10, by showing 
the states of the computation at the beginning of steps M5 and M6. A proof of 
Algorithm M appears in the answer to exercise 14. 

The two inequalities 


0<t<b, O<k<b (1) 


are crucial for an efficient implementation of this algorithm, since they point out 
how large a register is needed for the computations. These inequalities may be 
proved by induction as the algorithm proceeds, for if we have k < b at the start 
of step M4, we have 


ui X vj + Wij +k < (6-1) x (b—1)+(b-1)+(@b-1)=b -1 <b. 


The following MIX program shows the considerations that are necessary when 
Algorithm M is implemented on a computer. The coding for step M4 would be a 
little simpler if our computer had a “multiply-and-add” instruction, or if it had 
a double-length accumulator for addition. 


Program M (Multiplication of nonnegative integers). This program is analo- 
gous to Program A. rll = i— m, rl2 = j — n, rI3 = i+ j, CONTENTS (CARRY) = k. 


01 ENT1 M-1 1 M1. Initialize. 

02 JOV OFLO 1 Ensure that overflow is off. 
03 STZ W,1 M wr + 0. 

04 DEC1 1 M 

05 JINN *-2 M Repeat for m > rll > 0. 
06 ENN2 N 1 j< 0. 

07 1H LDX V+N,2 N M2. Zero multiplier? 

08 JXZ 8F N If vj = 0, set wj+m + 0 and go to M6. 
09 ENN1 M N-Z M3. Initialize i. i + 0. 

10 ENT3 N,2 N-Z (i+ ej. 

11 ENTX 0O N-Z k +0. 


12 2H STX CARRY 
13 LDA U+M,1 


( )M M4. Multiply and add. 

( )M 
14 MUL V+N,2  (N-Z)M rAX eux vj. 

( )M 

( )M 


15 SLC 5 Interchange rA 4 rX. 
16 ADD W,3 Add wi+; to lower half. 
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17 JNOV *+2 N-—Z)M Did overflow occur? 

18 INCX 1 K If so, carry 1 into upper half. 
19 ADD CARRY N-—Z)M_ Add k to lower half. 

20 JNOV *+2 N-—Z)M Did overflow occur? 

21 INCX 1 K’ If so, carry 1 into upper half. 
22 STA W,3 N-—Z)M wi+; + tmodb. 

23 INC1 1 N-Z)M M5. Loop oni. i+} i+l1. 
24 INC3 1 N-Z)M (i+j)}(i+j)+1l. 

25 JiN 2B N-—Z)M Back to M4 with rX = |t/b| ifi < m. 
26 8H STX W+M+N,2 N Set wjim & k. 

27 INC2 1 N M6. Loop on j. j} j +1. 
28 J2N 1B N Repeat until j=n. I 


The execution time of Program M depends on the number of places, M, in 
the multiplicand u; the number of places, N, in the multiplier v; the number 
of zeros, Z, in the multiplier; and the number of carries, K and K’, that occur 
during the addition to the lower half of the product in the computation of t. If we 
approximate both K and K’ by the reasonable (although somewhat pessimistic) 
values (N — Z)M, we find that the total running time comes to 23MN +4M + 
10N +3 — Z(28M +3) cycles. If step M2 were deleted, the running time would 
be 28MN +4M +7N +3 cycles, so that step is advantageous only if the density 
of zero positions within the multiplier is Z/N > 3/(28M + 3). If the multiplier 
is chosen completely at random, the ratio Z/N is expected to be only about 1/0, 
which is extremely small. We conclude that step M2 is usually not worthwhile, 
unless b is small. 

Algorithm M is not the fastest way to multiply when m and n are large, 
although it has the advantage of simplicity. Speedier but more complicated 
methods are discussed in Section 4.3.3; it is possible to multiply numbers faster 
than Algorithm M even when m = n = 4. 

The final algorithm of concern to us in this section is long division, in which 
we want to divide (m + n)-place integers by n-place integers. Here the ordinary 
pencil-and-paper method involves a certain amount of guesswork and ingenuity 
on the part of the person doing the division; we must either eliminate this guess- 
work from the algorithm or develop some theory to explain it more carefully. 

A moment’s reflection about the ordinary process of long division shows that 
the general problem breaks down into simpler steps, each of which is the division 
of an (n + 1)-place dividend u by the n-place divisor v, where 0 < u/v < b; 
the remainder r after each step is less than v, so we may use the quantity 
rb + (next place of dividend) as the new u in the succeeding step. For example, 
if we are asked to divide 3142 by 53, we first divide 314 by 53, getting 5 and 
a remainder of 49; then we divide 492 by 53, getting 9 and a remainder of 15; 
thus we have a quotient of 59 and a remainder of 15. It is clear that this same 
idea works in general, and so our search for an appropriate division algorithm 
reduces to the following problem (Fig. 6): 


Let u = (UnUn_1...U Uo)» and v = (Un_1-...U1U9)p be nonnegative integers in 
radix-b notation, where u/v < b. Find an algorithm to determine q = |u/v]. 
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q 

Un—1.--.-U1V0 ) UnUn—-1.-.-.U1U0 

, <— qu— 
Fig. 6. Wanted: a way to Se 
determine q rapidly. yp o 


We may observe that the condition u/v < b is equivalent to the condition that 
u/b < v, which is the same as |u/b| < v. This is simply the condition that 
(UnUn—1---U1)b < (Un_1Un—2---Vo)p- Furthermore, if we write r = u — qu, then 
q is the unique integer such that 0 < r < v. 

The most obvious approach to this problem is to make a guess about q, 
based on the most significant digits of u and v. It isn’t obvious that such a 
method will be reliable enough, but it is worth investigating; let us therefore set 


a= min (||, i (2) 


Un-1 


This formula says that ĝ is obtained by dividing the two leading digits of u by 
the leading digit of v; and if the result is b or more we can replace it by (b — 1). 

It is a remarkable fact, which we will now investigate, that this value q is 
always a very good approximation to the desired answer q, so long as Vn—1 is 
reasonably large. In order to analyze how close ĝ comes to q, we will first prove 
that ĝ is never too small. 


Theorem A. In the notation above, ĝ > q. 


Proof. Since q < b — 1, the theorem is certainly true if ĝ = b—1. Otherwise we 
have ĝ = | (unb + Un-1)/Un-1], hence Gup—1 > Unb + Un—-1 — Un-1 + 1. It follows 
that 


u— gu < u — qup_1b”"! 
< Unb” +--+ + ug — (Unb? + un—1b07! — unb! + 8") 


= Uun—2b”7? 4 Fug — OPT + un 0"! < vnb < v. 


Since u — qu < v, we must have ĝ > q. I 


We will now prove that ĝ cannot be much larger than q in practical situa- 
tions. Assume that ĝ > q +3. We have 
Und+Un—1 Unb” + un—1b”7! u u 


< = < : 
z Un—1 Un—1b0”71 g Un—1b0”71 y—or-l 


> 


(The case v = b"~" is impossible, for if v = (100...0), then q = 4.) Furthermore, 
the relation q > (u/v) — 1 implies that 


pr- 1 
ee oe tai t( Jen 


y—br-1 iy v 


Therefore 
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Finally, since b — 4 > ĝ — 3 > q = |u/v| > 2(vn-1 — 1), we have vp_1 < [6/2]. 
This proves the result we seek: 


Theorem B. Ifv- > |b/2|, theng-—2<q<q. I 


The most important part of this theorem is that the conclusion is indepen- 
dent of b; no matter how large the radix is, the trial quotient q will never be 
more than 2 in error. 

The condition that v,—1 > |b/2| is very much like a normalization require- 
ment; in fact, it is exactly the condition of floating-binary normalization in a 
binary computer. One simple way to ensure that vn—1 is sufficiently large is to 
multiply both u and v by |[b/(vz_1+1)]; this does not change the value of u/v, 
nor does it increase the number of places in v, and exercise 23 proves that it will 
always make the new value of vn—ı large enough. (Another way to normalize 
the divisor is discussed in exercise 28.) 

Now that we have armed ourselves with all of these facts, we are in a 
position to write the desired long-division algorithm. This algorithm uses a 
slightly improved choice of q in step D3, which guarantees that q = q or ĝ — 1; 
in fact, the improved choice of g made here is almost always accurate. 


Algorithm D (Division of nonnegative integers). Given nonnegative integers 
u = (Umtn—1---U1Uo)p and v = (Un—1...U1V0)p, Where vp_1 # 0 and n > 1, we 
form the radix-b quotient |u/v| = (qmqm-1 - - - Go)» and the remainder u mod v = 
(T-1---T1To)o- (When n = 1, the simpler algorithm of exercise 16 should 
be used.) 


D1. [Normalize.] Set d + |b/(vn-1 + 1)|. Then set (UntnUm4n—1---U1U0)b 
equal to (Umin—1---U1Uo), times d; similarly, set (Un—1...v1V0)» equal to 
(Un_1---U1U9)p times d. (Notice the introduction of a new digit position 
Um+tn at the left of Um+n-1; if d = 1, all we need to do in this step is to set 
Um+n + 0. On a binary computer it may be preferable to choose d to be 
a power of 2 instead of using the value suggested here; any value of d that 
results in vn—1ı > |b/2| will suffice. See also exercise 37.) 

D2. [Initialize j.] Set j 4 m. (The loop on j, steps D2 through D7, will be 
essentially a division of (Uj+n - - - Uj+1Uj)b by (Un_1..- 10)» to get a single 
quotient digit qj; see Fig. 6.) 


D3. [Calculate g.] Set @ + | (uj4+nb+Ujtn—1)/Un—1| and let ĉ be the remainder, 
(Uj+nb + Uj4n—1) mod vz-1. Now test if ĝ > b or Gun_2 > bÔ + Uj4n—2; if 
so, decrease ĝ by 1, increase * by vp_1, and repeat this test if ê < b. (The 
test on Un—2 determines at high speed most of the cases in which the trial 
value q is one too large, and it eliminates all cases where ĝ is two too large; 
see exercises 19, 20, 21.) 


D4. [Multiply and subtract.] Replace (uj+nUj+n-1 - - -uj )b by 


(Wig tiie i eee Us )b = G(Ovn_1 aoe U1V0)b- 


This computation (analogous to steps M3, M4, and M5 of Algorithm M) 
consists of a simple multiplication by a one-place number, combined with 
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Vv 


D1. Normalize 


D2. Initialize j q#4q D6. Add back 


yY 
D3. Calculate ¢ |>| D4, Multiply (5 ret D7. Loop on j >| Ds. Unnormalize 


and subtract remainder 


D5. 


D6. 


D7. 
D8. 


x = i 


Fig. 7. Long division. 


a subtraction. The digits (ujin,Ujin—1,---,U,y) should be kept positive; if 
the result of this step is actually negative, (uj+nUjin—1---Uj)p Should be 
left as the true value plus b"+!, namely as the b’s complement of the true 
value, and a “borrow” to the left should be remembered. 


[Test remainder.] Set q; + q. If the result of step D4 was negative, go to 
step D6; otherwise go on to step D7. 


[Add back.] (The probability that this step is necessary is very small, on 
the order of only 2/b, as shown in exercise 21; test data to activate this step 
should therefore be specifically contrived when debugging. See exercise 22.) 
Decrease qj by 1, and add (OUn-1 suds V1V0)b to (uj tnUjtin—1-++-Uj4 1U;)b- (A 


carry will occur to the left of uj+n, and it should be ignored since it cancels 
with the borrow that occurred in D4.) 


[Loop on j.] Decrease j by one. Now if j > 0, go back to D3. 


[Unnormalize.] Now (qm ...qıqo)» is the desired quotient, and the desired 
remainder may be obtained by dividing (un_1...uiuo)» by d. | 


The representation of Algorithm D as a MIX program has several points of 


interest: 


Program D (Division of nonnegative integers). The conventions of this program 
are analogous to Program A; rll = i — n, rI2 = j, r3 =i + j. 


001 
039 
040 
041 
042 
043 
044 
045 
046 
O47 


Di JOV OFLO 1 D1. Normalize. 
(See exercise 25) 

D2 ENT2 M 1 D2. Initialize j. j < m. 

STZ V+N 1 Set un < 0, for convenience in D4. 
D3 LDA U+N,2(1:5) M+l1 D3. Calculate â. 

LDX U+N-1 j 2 M+1 rAX + Uj+nb + Uj+n—1- 

DIV V+N-1 M+1 rA ¢ |rAX/vn-1]. 

JOV 1F M+1 Jump if quotient > b. 

STA QHAT M+1 ĝî +} rA. 

STX RHAT M+1 Pe Uj+nb + Ujtn—1 — QUn-1 

JMP 2F M+1 = (ujinb + Uj+n—1) mod Un—-1- 
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048 1H LDX WM1 rX ¢+ b-1. 

O49 LDA U+N-1 32 rA + Uj+n—1- (Here Uj+n = Un—1-) 
050 JMP 4F 

051 3H LDX QHAT E 

052 DECX 1 E Decrease ĝ by one. 

058 LDA RHAT E Adjust # accordingly: 

054 4H STX QHAT E ĝe rx. 

055 ADD V+N-1 E rA {#4 Un-1. 

056 JOV D4 E (If # will be > b, Gun—2 will be < fb.) 
057 STA RHAT E Pord. 

058 LDA QHAT E 

059 2H MUL V+N-2 M+E+1 

060 CMPA RHAT M+E+1 Test if @un—2 < fb+ Ujtn-2- 
061 JL D4 M+E+ 

062 JG 3B E 

063 CMPX U+N-2,2 

064 JG 3B If not, ĝ is too large. 

065 D4 ENTX 1 M+1 D4, Multiply and subtract. 

066 ENN1i N M+1 i +0. 

067 ENT3 0,2 M+1 (+i) ei. 

068 2H STX CARRY M+1)(N+1) (Here 1-—b< rX < +1.) 

069 LDAN V+N,1 M +1)(N +1) 

070 MUL QHAT M+1)(N+4+1) rAX & —ĝvi. 

O71 SLC 5 M=+1)(N +1) Interchange rA 4 rX. 

072 ADD CARRY M+1)(N+1) Add the contribution from the 
073 JNOV *+2 M+1)(N +1) digit to the right, plus 1. 
074 DECX 1 K If sum is < —b, carry —1. 

075 ADD U,3 M+1)(N+1) Add wi+;. 

076 ADD WM1 M+1)(N+1) Add b- 1 to force + sign. 

O77 JNOV *+2 M+1)(N+1) If no overflow, carry —1. 

078 INCX 1 K’ rX = carry + 1. 

079 STA U,3 M+1)(N+1) w+; 4 rA (may be minus zero). 
080 INC1 1 M+1)(N +1) 

081 INC3 1 M+1)(N +1) 

082 JiNP 2B M+1)(N+1) Repeat forO<i<n. 

088 D5 LDA QHAT M+1 D5. Test remainder. 

084 STA Q,2 M+1 Set qj + @. 

085 JXP D7 M+1 (Here rX = 0 or 1, since vn = 0.) 
086 D6 DECA 1 D6. Add back. 

087 STA Q,2 Set qj} ĝ— 1. 

088 ENN1 N i +0. 

089 ENT3 0,2 (+i) i. 

090 1H ENTA O (This is essentially Program A.) 
091 2H ADD U,3 

092 ADD V+N,1 

093 STA U,3 

094 INC1 1 

095 INC3 1 

096 JNOV 1B 
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097 ENTA 1 


098 JiNP 2B 

099 D7 DEC2 1 M+1 D7. Loop on j. 

100 J2NN D3 M+1 Repeat for m > j > 0. 
101 D8 ---: (See exercise 26) J 


Note how easily the rather complex-appearing calculations and decisions of 
step D3 can be handled inside the machine. Notice also that the program for 
step D4 is analogous to Program M, except that the ideas of Program S have 
also been incorporated. 

The running time for Program D can be estimated by considering the quan- 
tities M, N, E, K, and K’ shown in the program. (These quantities ignore several 
situations that occur only with very low probability; for example, we may assume 
that lines 048-050, 063-064, and step D6 are never executed.) Here M + 1 is 
the number of words in the quotient; N is the number of words in the divisor; 
E is the number of times ĝ is adjusted downwards in step D3; K and K’ are 
the number of times certain carry adjustments are made during the multiply- 
subtract loop. If we assume that K + K’ is approximately (N + 1)(M + 1), 
and that E is approximately 5M. , we get a total running time of approximately 
30MN + 30N + 89M +111 cycles, plus 67N + 23.5M +4 more if d > 1. (The 
program segments of exercises 25 and 26 are included in these totals.) When M 
and N are large, this is only about seven percent longer than the time needed 
by Program M to multiply the quotient by the divisor. 

When the radix b is comparatively small, so that b? is less than the com- 
puter’s word size, multiprecision division can be speeded up by not reducing 
individual digits of intermediate results to the range [0..b); see D. M. Smith, 
Math. Comp. 65 (1996), 157-163. Further commentary on Algorithm D appears 
in the exercises at the close of this section. 

It is possible to debug programs for multiple-precision arithmetic by using 
the multiplication and addition routines to check the result of the division 
routine, etc. The following type of test data is occasionally useful: 


(¢” —1)(t" -—1) =t e —t" 41. 


If m <n, this number has the radix-t expansion 


t-11) ... (t-1) (¢-2) (t-1) ... (#-1) 0... 0 1; 
n——_ ———— es oS 
m—1 places n—m places m—1 places 


for example, (10° — 1)(108 — 1) = 99899999001. In the case of Program D, it is 
also necessary to find some test cases that cause the rarely executed parts of the 
program to be exercised; some portions of that program would probably never 
get tested even if a million random test cases were tried. (See exercise 22.) 


Now that we have seen how to operate with signed magnitude numbers, 
let us consider what approach should be taken to the same problems when a 
computer with complement notation is being used. For two’s complement and 
ones’ complement notations, it is usually best to let the radix b be one half of the 
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word size; thus for a 32-bit computer word we would use b = 2! in the algorithms 
above. The sign bit of all but the most significant word of a multiple-precision 
number will be zero, so that no anomalous sign correction takes place during the 
computer’s multiplication and division operations. In fact, the basic meaning of 
complement notation requires that we consider all but the most significant word 
to be nonnegative. For example, assuming an 8-bit word, the two’s complement 
number 
11011111 1111110 1101011 
(where the sign bit is shown only in the most significant word) is properly thought 
of as 21 14 7 
—2°* + (1011111)2 -2°* + (1111110)2 - 2° + (1101011)2. 

On the other hand, some binary computers that work with two’s complement 
notation also provide true unsigned arithmetic as well. For example, let x 
and y be 32-bit operands. A computer might regard them as two’s complement 
numbers in the range —23! < x,y < 231, or as unsigned numbers in the range 
0 < x,y < 2%". If we ignore overflow, the 32-bit sum (x+y) mod 23? is the same 
under either interpretation; but overflow occurs in different circumstances when 
we change the assumed range. If the computer allows easy computation of the 
carry bit |( + y)/2°?| in the unsigned interpretation, and if it provides a full 
64-bit product of unsigned 32-bit integers, we can use b = 2%? instead of b = 23! 
in our high-precision algorithms. 

Addition of signed numbers is slightly easier when complement notations 
are being used, since the routine for adding n-place nonnegative integers can be 
used for arbitrary n-place integers; the sign appears only in the first word, so 
the less significant words may be added together irrespective of the actual sign. 
(Special attention must be given to the leftmost carry when ones’ complement 
notation is being used, however; it must be added into the least significant word, 
and possibly propagated further to the left.) Similarly, we find that subtraction 
of signed numbers is slightly simpler with complement notation. On the other 
hand, multiplication and division seem to be done most easily by working with 
nonnegative quantities and doing suitable complementation operations before- 
hand to make sure that both operands are nonnegative. It may be possible to 
avoid this complementation by devising some tricks for working directly with 
negative numbers in a complement notation, and it is not hard to see how this 
could be done in double-precision multiplication; but care should be taken not 
to slow down the inner loops of the subroutines when high precision is required. 


Let us now turn to an analysis of the quantity K that arises in Program A, 
namely the number of carries that occur when two n-place numbers are being 
added together. Although K has no effect on the total running time of Pro- 
gram A, it does affect the running time of the Program A’s counterparts that 
deal with complement notations, and its analysis is interesting in itself as a 
significant application of generating functions. 

Suppose that u and v are independent random n-place integers, uniformly 
distributed in the range 0 < u,v < b”. Let pnx be the probability that exactly 
k carries occur in the addition of u to v, and that one of these carries occurs 
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in the most significant position (so that u + v > b”). Similarly, let qn, be the 
probability that exactly k carries occur, but that there is no carry in the most 
significant position. Then it is not hard to see that, for all k and n, 


b+1 b-1 
Pok = 9, P(n+1)(k+1) = pp Prk zp P% 

i>i b+1 (3) 
qok = Sok; Un+1)k = pp Prk + -5g frk; 


this happens because (b — 1)/2b is the probability that u,_, + Un_1 > b and 
(b+1) /20 is the probability that uy_1+vn_1+1 > b, when u,_1 and vn—1 are in- 
dependently and uniformly distributed integers in the range 0 < Un—1,Un—1 < b. 

To obtain further information about these quantities pnk and qnk, we set up 
the generating functions 


t)= do Pri Zhe”, Q(z,t) = 5 ie (4) 
k,n 


k,n 


From (3) we have the basic relations 


P(z,t) = zt day Es jr Gen ; 
2b 2b 


Q(z,t)=1+t (Sr Pen r eD) . 


These two equations are readily solved for P(z,t) and t); and if we let 
G(z,t) = P(z,t) + Q(z, t) 52 Gal 


where G,,(z) is the generating function for the total number of carries when 
n-place numbers are added, we find that 
G(z,t) = (b — 2t)/p(z,t), where p(z,t) =b- $(1+)(1+ 2)t + 2t. (5) 


Note that G(1,t) = 1/(1 — t), and this checks with the fact that G,(1) must 
equal 1 (it is the sum of all the possible probabilities). Taking partial derivatives 
of (5) with respect to z, we find that 


= 7G (2)t" = —t  t(b— zt)(b+1-— 2t). 

E ~ p(z,t) 2p(z, t)? 

=e n _ —U(b+1-2t) t(b-— 2t)(b+ 1-22)? 
aen p(z, t)? 2p(z, t)3 


Now let us put z = 1 and expand in partial fractions: 


cic. it 1 l l 


n" nt 1 1 5 : 
DKEA GNI = (r (6—1)2(1 DG 1)2(b I0 1)(b z) 


n 
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It follows that the average number of carries, the mean value of K, is 
a(t) = (r al (5) )) (6) 
the variance is 
Ga) + G0) - G0)? 
=} (e+ 2n 2b+1 i 2b+2 ei 1 G) (7) 
4 b-1 (6-1)? (b-—1)?\b (b — 1)? \b 


So the number of carries is just slightly less than in under these assumptions. 


History and bibliography. The early history of the classical algorithms 
described in this section is left as an interesting project for the reader, and 
only the history of their implementation on computers will be traced here. 

The use of 10” as an assumed radix when multiplying large numbers on a 
desk calculator was discussed by D. N. Lehmer and J. P. Ballantine, AMM 30 
(1923), 67-69. 

Double-precision arithmetic on digital computers was first treated by J. von 
Neumann and H. H. Goldstine in their introductory notes on programming, 
originally published in 1947 |J. von Neumann, Collected Works 5, 142-151]. 
Theorems A and B above are due to D. A. Pope and M. L. Stein [CACM 3 
(1960), 652-654], whose paper also contains a bibliography of earlier work on 
double-precision routines. Other ways of choosing the trial quotient q have been 
discussed by A. G. Cox and H. A. Luther, CACM 4 (1961), 353 [divide by v,_1+1 
instead of vn—1], and by M. L. Stein, CACM 7 (1964), 472-474 [divide by up—1 
or Un—1 + 1 according to the magnitude of v,—2]; E. V. Krishnamurthy [CACM 
8 (1965), 179-181] showed that examination of the single-precision remainder in 
the latter method leads to an improvement over Theorem B. Krishnamurthy and 
Nandi [CACM 10 (1967), 809-813] suggested a way to replace the normalization 
and unnormalization operations of Algorithm D by a calculation of g based on 
several leading digits of the operands. G. E. Collins and D. R. Musser have 
carried out an interesting analysis of the original Pope and Stein algorithm 
(Information Processing Letters 6 (1977), 151-155]. 

Several alternative approaches to division have also been suggested: 

1) “Fourier division” [J. Fourier, Analyse des Equations Déterminées (Paris: 
1831), §2.21]. This method, which was often used on desk calculators, essentially 
obtains each new quotient digit by increasing the precision of the divisor and the 
dividend at each step. Some rather extensive tests by the author have indicated 
that such a method is inferior to the divide-and-correct technique above, but 
there may be some applications in which Fourier division is practical. See D. H. 
Lehmer, AMM 33 (1926), 198-206; J. V. Uspensky, Theory of Equations (New 
York: McGraw-Hill, 1948), 159-164. 

2) “Newton’s method” for evaluating the reciprocal of a number was extensively 
used in early computers when there was no single-precision division instruction. 
The idea is to find some initial approximation x9 to the number 1/v, then to let 
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Int. = 2X, — vx2. This method converges rapidly to 1/v, since £n = (1 — €)/v 
implies that x,4, = (1 — €*)/v. Convergence to third order, with e€ replaced by 
O(e?) at each step, can be obtained using the formula 


Ln41 = Zn + Zn(1 — VEn) + £n(1 — VEn)? 
= Tn (1 + (1 — vzn )(1 + (1 — vzn))), 


and similar formulas hold for fourth-order convergence, etc.; see P. Rabinowitz, 
CACM 4 (1961), 98. For calculations on extremely large numbers, Newton’s 
second-order method and subsequent multiplication by u can actually be consid- 
erably faster than Algorithm D, if we increase the precision of £n at each step and 
if we also use the fast multiplication routines of Section 4.3.3. (See Algorithm 
4.3.3R for details.) Some related iterative schemes have been discussed by E. V. 
Krishnamurthy, IEEE Trans. C-19 (1970), 227-231. 


3) Division methods have also been based on the evaluation of 


u u ( € €\2 €\3 
N 
UFE v v v v 
See H. H. Laughlin, AMM 37 (1930), 287-293. We have used this idea in the 
double-precision case (Eq. 4.2.3-(2)). 


Besides the references just cited, the following early articles concerning 
multiple-precision arithmetic are also of interest: High-precision routines for 
floating point calculations using ones’ complement arithmetic were described 
by A. H. Stroud and D. Secrest, Comp. J. 6 (1963), 62-66. Extended-precision 
subroutines for use in FORTRAN programs were described by B. I. Blum, CACM 
8 (1965), 318-320, and for use in ALGOL by M. Tienari and V. Suokonautio, 
BIT 6 (1966), 332-338. Arithmetic on integers with unlimited precision, making 
use of linked memory allocation techniques, was elegantly introduced by G. E. 
Collins, CACM 9 (1966), 578-589. For a much larger repertoire of multiple- 
precision operations, including logarithms and trigonometric functions, see R. P. 
Brent, ACM Trans. Math. Software 4 (1978), 57-81; D. M. Smith, ACM Trans. 
Math. Software 17 (1991), 273-283, 24 (1998), 359-367. 

Human progress in calculation has traditionally been measured by the num- 
ber of decimal digits of 7 that were known at a given point in history. Section 4.1 
mentions some of the early developments; by 1719, Thomas Fantet de Lagny 
had computed 7 to 127 decimal places [Mémoires Acad. Sci. (Paris, 1719), 135- 
145; a typographical error affected the 113th digit]. After better formulas were 
discovered, a famous mental calculator from Hamburg named Zacharias Dase 
needed less than two months to calculate 200 decimal digits correctly in 1844 
[Crelle 27 (1844), 198]. Then William Shanks published 607 decimals of m in 
1853, and continued to extend his calculations until he had obtained 707 digits 
in 1873. [See W. Shanks, Contributions to Mathematics (London: 1853); Proc. 
Royal Soc. London 21 (1873), 318-319; 22 (1873), 45-46; J. C. V. Hoffmann, 
Zeit. fiir math. und naturwiss. Unterricht 26 (1895), 261-264.] Shanks’s 707- 
place value was widely quoted in mathematical reference books for many years, 
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but D. F. Ferguson noticed in 1945 that it contained several mistakes beginning 
at the 528th decimal place [Math. Gazette 30 (1946), 89-90]. G. Reitwiesner 
and his colleagues used 70 hours of computing time on ENIAC during Labor 
Day weekend in 1949 to obtain 2037 correct decimals [Math. Tables and Other 
Aids to Comp. 4 (1950), 11-15]. F. Genuys reached 10,000 digits in 1958, 
after 100 minutes on an IBM 704 [Chiffres 1 (1958), 17-22]; shortly afterwards, 
the first 100,000 digits were published by D. Shanks [no relation to William] 
and J. W. Wrench, Jr. [Math. Comp. 16 (1962), 76-99], after about 8 hours 
on an IBM 7090 and another 4.5 hours for checking. Their check actually 
revealed a transient hardware error, which went away when the computation 
was repeated. One million digits of 7 were computed by Jean Guilloud and 
Martine Bouyer of the French Atomic Energy Commission in 1973, after nearly 
24 hours of computer time on a CDC 7600 [see A. Shibata, Surikagaku 20 
(1982), 65-73]. Amazingly, Dr. I. J. Matrix had correctly predicted seven years 
earlier that the millionth digit would turn out to be “5” [Martin Gardner, New 
Mathematical Diversions (Simon and Schuster, 1966), addendum to Chapter 8]. 
The billion-digit barrier was passed in 1989 by Gregory V. Chudnovsky and 
David V. Chudnovsky, and independently by Yasumasa Kanada and Yoshiaki 
Tamura; the Chudnovskys extended their calculation to two billion digits in 1991, 
after 250 hours of computation on a home-built parallel machine. [See Richard 
Preston, The New Yorker 68,2 (2 March 1992), 36-67. The novel formula used 
by the Chudnovskys is described in Proc. Nat. Acad. Sci. 86 (1989), 8178-8182.| 
Yasumasa Kanada and Daisuke Takahashi obtained more than 51.5 billion digits 
in July, 1997, using two independent methods that required respectively 29.0 and 
37.1 hours on a HITACHI SR2201 computer with 1024 processing elements. By 
2011 the world record had risen to ten trillion digits(!), obtained by A. J. Yee 
and S. Kondo using the Chudnovsky formula together with exercise 39. 

We have restricted our discussion in this section to arithmetic techniques for 
use in computer programming. Many algorithms for hardware implementation 
of arithmetic operations are also quite interesting, but they appear to be inap- 
plicable to high-precision software routines; see, for example, G. W. Reitwiesner, 
“Binary Arithmetic,” Advances in Computers 1 (New York: Academic Press, 
1960), 231-308; O. L. MacSorley, Proc. IRE 49 (1961), 67-91; G. Metze, IRE 
Trans. EC-11 (1962), 761-764; H. L. Garner, “Number Systems and Arith- 
metic,” Advances in Computers 6 (New York: Academic Press, 1965), 131- 
194. An infamous but very instructive bug in the division routine of the 1994 
Pentium chip is discussed by A. Edelman in SIAM Review 39 (1997), 54-67. The 
minimum achievable execution time for hardware addition and multiplication 
operations has been investigated by S. Winograd, JACM 12 (1965), 277-285, 
14 (1967), 793-802; by R. P. Brent, IEEE Trans. C-19 (1970), 758-759; and by 
R. W. Floyd, FOCS 16 (1975), 3-5. See also Section 4.3.3E. 


EXERCISES 


1. [42] Study the early history of the classical algorithms for arithmetic by looking up 
the writings of, say, Sun Tst, al-Khwarizmi, al-Ugqlidisi, Fibonacci, and Robert Recorde, 
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and by translating their methods as faithfully as possible into precise algorithmic 
notation. 


2. [15] Generalize Algorithm A so that it does “column addition,” obtaining the 
sum of m nonnegative n-place integers. (Assume that m < b.) 


3. [21] Write a MIX program for the algorithm of exercise 2, and estimate its running 
time as a function of m and n. 


4. [M21] Give a formal proof of the validity of Algorithm A, using the method of 
inductive assertions explained in Section 1.2.1. 


5. [21] Algorithm A adds the two inputs by going from right to left, but sometimes 
the data is more readily accessible from left to right. Design an algorithm that produces 
the same answer as Algorithm A, but that generates the digits of the answer from left 
to right, going back to change previous values if a carry occurs to make a previous 
value incorrect. [Note: Early Hindu and Arabic manuscripts dealt with addition from 
left to right in this way, probably because it was customary to work from left to right 
on an abacus; the right-to-left addition algorithm was a refinement due to al-Uqlidisi, 
perhaps because Arabic is written from right to left.] 


6. [22] Design an algorithm that adds from left to right (as in exercise 5), but never 
stores a digit of the answer until this digit cannot possibly be affected by future carries; 
there is to be no changing of any answer digit once it has been stored. [Hint: Keep track 
of the number of consecutive (b — 1)’s that have not yet been stored in the answer.| 
This sort of algorithm would be appropriate, for example, in a situation where the 
input and output numbers are to be read and written from left to right on magnetic 
tapes, or if they appear in straight linear lists. 


7. [M26] Determine the average number of times the algorithm of exercise 5 will find 
that a carry makes it necessary to go back and change k digits of the partial answer, for 
k= 1, 2,..., n. (Assume that both inputs are independently and uniformly distributed 
between 0 and b” — 1.) 


8. [M26] Write a MIX program for the algorithm of exercise 5, and determine its 
average running time based on the expected number of carries as computed in the text. 


9. [21] Generalize Algorithm A to obtain an algorithm that adds two n-place num- 
bers in a mized-radiz number system, with bases bo, b1, ... (from right to left). Thus 
the least significant digits lie between 0 and bo — 1, the next digits lie between 0 and 
bı — 1, etc.; see Eq. 4.1-(9). 


10. [18] Would Program S work properly if the instructions on lines 06 and 07 were 
interchanged? If the instructions on lines 05 and 06 were interchanged? 


11. [10] Design an algorithm that compares two nonnegative n-place integers u = 
(tn—1-..U1Uo)p and v = (Un-1...U1U0)», to determine whether u < v, u = v, or u > v. 


12. [16] Algorithm S assumes that we know which of the two input operands is the 
larger; if this information is not known, we could go ahead and perform the subtraction 
anyway, and we would find that an extra borrow is still present at the end of the 
algorithm. Design another algorithm that could be used (if there is a borrow present 
at the end of Algorithm S) to complement (wn-1...wiwo)» and therefore to obtain 
the absolute value of the difference of u and v. 


13. [21] Write a MIX program that multiplies (un—1 . . . u1u0)b by v, where v is a single- 
precision number (that is, 0 < v < b), producing the answer (wn . .. w1wo)b. How much 
running time is required? 
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> 14. [M22] Give a formal proof of the validity of Algorithm M, using the method of 
inductive assertions explained in Section 1.2.1. (See exercise 4.) 


15. [M20] If we wish to form the product of two n-place fractions, (.wju2...Un)» X 
(.v1v2...Un)», and to obtain only an n-place approximation (.wiw2...Wn)», to the 
result, Algorithm M could be used to obtain a 2n-place answer that is subsequently 
rounded to the desired approximation. But this involves about twice as much work as 
is necessary for reasonable accuracy, since the products u,v; for i+ j > n+2 contribute 
very little to the answer. 

Give an estimate of the maximum error that can occur, if these products u,v; for 
i+j >n+2 are not computed during the multiplication, but are assumed to be zero. 


> 16. [20] (Short division.) Design an algorithm that divides a nonnegative n-place 
integer (un—-1...U1uo)» by v, where v is a single-precision number (that is, 0 < v < b), 
producing the quotient (wn-1...w1wo)) and remainder r. 


17. [M20] In the notation of Fig. 6, assume that vn—1 > |b/2|; show that if un = un—1, 
we must have q= b—1 or 6-2. 


18. [M20] In the notation of Fig. 6, show that if q' = |(unb+un-1)/(vn-1+1)], then 


> 19. [M21] In the notation of Fig. 6, let ĝ be an approximation to q, and let f = 
Unb + Un—-1 — GUn—-1. Assume that un—1 > 0. Show that if Gun—2 > bÔ + Un—2, then 
q < q. [Hint: Strengthen the proof of Theorem A by examining the influence of un—2.] 
20. [M22] Using the notation and assumptions of exercise 19, show that if Gun-2 < 
bf + un—2 and ĝ < b, then G=qorq=q-1. 

> 21. [M23] Show that if vn-ı > |b/2|, and if Gun-2 < b + un—2 but @ # q in the 
notation of exercises 19 and 20, then u mod v > (1 — 2/b)v. (The latter event occurs 
with approximate probability 2/b, so that when b is the word size of a computer we 
must have q; = ĝ in Algorithm D except in very rare circumstances.) 


> 22. [24] Find an example of a four-digit number divided by a three-digit number for 
which step D6 is necessary in Algorithm D, when the radix b is 10. 


23. [M23] Given that v and b are integers, and that 1 < v < b, prove that we always 
have |b/2| < v|b/(v + 1)] < (v+1)[b/(v+ 1)| <b. 


24. [M20] Using the law of the distribution of leading digits explained in Section 4.2.4, 
give an approximate formula for the probability that d = 1 in Algorithm D. (When 
d= 1, we can omit most of the calculation in steps D1 and D8.) 


25. [26] Write a MIX routine for step D1, which is needed to complete Program D. 
26. [21] Write a MIX routine for step D8, which is needed to complete Program D. 


27. [M20] Prove that at the beginning of step D8 in Algorithm D, the unnormalized 
remainder (un—1...UiUo)s is always an exact multiple of d. 


28. [M30] (A. Svoboda, Stroje na Zpracovani Informaci 9 (1963), 25-32.) Let v = 
(Un—1-..U1V0)b be any radix b integer, where vn—-1 # 0. Perform the following opera- 
tions: 


N1. If vn—1 < b/2, multiply v by |(b + 1)/(un-1 + 1)|. Let the result of this step 
be (UnUn—1-.-V1V0)b- 


N2. If vn = 0, set v 4+ v + (1/b)|b(b — Un—1)/(Un—1 + 1)Jv; let the result of this 
step be (unUn-1...U0-U-1...)». Repeat step N2 until vn #0. J 
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Prove that step N2 will be performed at most three times, and that we must always 
have vn = 1, Un—1 = 0 at the end of the calculations. 

[Note: If u and v are both multiplied by the constants above, we do not change 
the value of the quotient u/v, and the divisor has been converted into the form 
(10un—2...v0-v-1U-2v-3)p. This form of the divisor is very convenient because, in 
the notation of Algorithm D, we may simply take ĝ = uj+n as a trial divisor at the 
beginning of step D3, or ĝ = b— 1 when (uj4n41, Ujin) = (1,0). 

29. [15] Prove or disprove: At the beginning of step D7 of Algorithm D, we always 
have Ujtin = 0. 

30. [22] If memory space is limited, it may be desirable to use the same storage 
locations for both input and output during the performance of some of the algorithms 


in this section. Is it possible to have wo, wi, ..., Wn—1 stored in the same respective 
locations as uo, ..., Un—1 OF Uo, . . . , Un—1 during Algorithm A or S? Is it possible to have 
the quotient go, ..., qm occupy the same locations as Un, ..., Um+n in Algorithm D? 


Is there any permissible overlap of memory locations between input and output in 
Algorithm M? 

31. [28] Assume that b = 3 and that u = (Um+n-1...U1U0)3, V = (Un—1...U1U0)3 are 
integers in balanced ternary notation (see Section 4.1), vn—1 Æ 0. Design a long-division 
algorithm that divides u by v, obtaining a remainder whose absolute value does not 
exceed ilu]. Try to find an algorithm that would be efficient if incorporated into the 
arithmetic circuitry of a balanced ternary computer. 


32. [M40] Assume that b = 2i and that u and v are complex numbers expressed in 
the quater-imaginary number system. Design algorithms that divide u by v, perhaps 
obtaining a suitable remainder of some sort, and compare their efficiency. 


33. [M40] Design an algorithm for taking square roots, analogous to Algorithm D 
and to the traditional pencil-and-paper method for extracting square roots. 


34. [40] Develop a set of computer subroutines for doing the four arithmetic opera- 
tions on arbitrary integers, putting no constraint on the size of the integers except for 
the implicit assumption that the total memory capacity of the computer should not be 
exceeded. (Use linked memory allocation, so that no time is wasted in finding room to 
put the results.) 


35. [40] Develop a set of computer subroutines for “decuple-precision floating point” 
arithmetic, using excess 0, base b, nine-place floating point number representation, 
where b is the computer word size, and allowing a full word for the exponent. (Thus 
each floating point number is represented in 10 words of memory, and all scaling is 
done by moving full words instead of by shifting within the words.) 


36. [M25] Explain how to compute In¢ to high precision, given a suitably precise 
approximation to ¢, using only multiprecision addition, subtraction, and division by 
small numbers. 


37. [20] (E. Salamin.) Explain how to avoid the normalization and unnormalization 
steps of Algorithm D, when d is a power of 2 on a binary computer, without changing 
the sequence of trial quotient digits computed by that algorithm. (How can @ be 
computed in step D3 if the normalization of step D1 hasn’t been done?) 


38. [M35] Suppose u and v are integers in the range 0 < u,v < 2”. Devise a way 
to compute the geometric mean |/uv + 3| by doing O(n) operations of addition, 
subtraction, and comparison of (n+2)-bit numbers. [Hint: Use a “pipeline” to combine 
the classical methods of multiplication and square rooting. | 
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39. [25] (D. Bailey, P. Borwein, and S. Plouffe, 1996.) Explain how to compute the 
nth bit of the binary representation of m without knowing the previous n — 1 bits, by 
using the identity 


"E T 1 ( 4 2 1 1 ) 
solo 8&k+1 8k+4 8k+5 8k+6 
and doing O(n logn) arithmetic operations on O(log n)-bit integers. (Assume that the 
binary digits of m do not have surprisingly long stretches of consecutive Os or 1s.) 


40. [M24] Sometimes we want to divide u by v when we know that the remainder 
will be zero. Show that if u is a 2n-place number and v is an n-place number with 
u mod v = 0, we can save about 75% of the work of Algorithm D if we compute half of 
the quotient from left to right and the other half from right to left. 


> 41. [M26] Many applications of high-precision arithmetic require repeated calcula- 
tions modulo a fixed n-place number w, where w is relatively prime to the base b. We 
can speed up such calculations by using a trick due to Peter L. Montgomery [Math. 
Comp. 44 (1985), 519-521], which streamlines the remaindering process by essentially 
working from right to left instead of left to right. 


a) Given u = +(Um+n—1---U1U0)p, W = (Wn—-1---W1Wo)p, and a number w’ such 
that wow’ mod b = 1, show how to compute v = +(vn—1-.-vivo0)» such that 
b™v mod w = u mod w. 

b) Given n-place signed integers u, v, w with |u|, |v| < w, and given w’ as in (a), show 
how to calculate an n-place integer t such that |t| < w and b”t = uv (modulo w). 

c) How do the algorithms of (a) and (b) facilitate arithmetic mod w? 

42. [HM35] Given m and b, let Pag be the probability that | (u1 +--+ um)/b”] = k, 
when ui, ..., Um are random n-place integers in radix b. (This is the distribution of 
wn in the column addition algorithm of exercise 2.) Show that Pak = 4,(7)+O0(b-”), 
where ("") is an Eulerian number (see Section 5.1.3). 


> 43. [22] Shades of gray or components of color values in digitized images are usually 
represented as 8-bit numbers u in the range [0..255], denoting the fraction u/255. 
Given two such fractions u/255 and v/255, graphical algorithms often need to compute 
their approximate product w/255, where w is the nearest integer to wv/255. Prove 
that w can be obtained from the efficient formula 


t = uv + 128, w = |([t/256] + t)/256]. 


*4,3.2. Modular Arithmetic 


Another interesting alternative is available for doing arithmetic on large integer 
numbers, based on some simple principles of number theory. The idea is to have 
several moduli m1, m2, ..., My that contain no common factors, and to work 
indirectly with residues u mod mı, u mod mz, ..., u mod m, instead of directly 
with the number u. 

For convenience in notation throughout this section, let 


u, = u mod mı, uz = u mod mg, oe ur = u mod mr. (1) 


It is easy to compute (u1, U2,..., Urp) from an integer number u by means of 
division; and it is important to note that no information is lost in this process (if 
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u isn’t too large), since we can recompute u from (u1,U2,...,U,). For example, 
if0 < u < v < 1000, it is impossible to have (umod7, u mod 11, u mod 13) 
equal to (vmod 7, v mod 11, vmod 13). This is a consequence of the “Chinese 
remainder theorem” stated below. 

We may therefore regard (u1, u2,..., Ur) as a new type of internal computer 
representation, a “modular representation,” of the integer u. 

The advantages of a modular representation are that addition, subtraction, 
and multiplication are very simple: 


(u1,.--,Ur) + (U1,---, Ur) = ((u1 +v) mod mı, ..., (Up + ur) mod mr), (2) 
(u1, ..., Ur) — (U1,---, Ur) = ((u1 — vı) mod mı, ..., (Up — vr) mod mr), (3) 
(U1,...,Ur) X (U1,..., Ur) = ((u1 X v1) mod mı, ..., (ur X vp) mod mr). (4) 


To derive (4), for example, we need to show that 
uv mod m; = (u mod m,;)(v mod mj) mod mj 


for each modulus m;. But this is a basic fact of elementary number theory: 
x mod m; = y mod m; if and only if x = y (modulo m,); furthermore if z = 2’ 
and y = y’, then zy = 2’y’ (modulo mj); hence (u mod m;)(v mod mj) = wv 
(modulo m,;). 

The main disadvantage of a modular representation is that we cannot easily 
test whether (u1,..., Ur) is greater than (v1,..., Ur). It is also difficult to test 
whether or not overflow has occurred as the result of an addition, subtraction, 
or multiplication, and it is even more difficult to perform division. When such 
operations are required frequently in conjunction with addition, subtraction, and 
multiplication, the use of modular arithmetic can be justified only if fast means 
of conversion to and from the modular representation are available. Therefore 
conversion between modular and positional notation is one of the principal topics 
of interest to us in this section. 

The processes of addition, subtraction, and multiplication using (2), (3), 
and (4) are called residue arithmetic or modular arithmetic. The range of num- 
bers that can be handled by modular arithmetic is equal to m = mym2...mMr, 
the product of the moduli; and if each m; is near our computer’s word size we 
can deal with n-place numbers when r ~ n. Therefore we see that the amount 
of time required to add, subtract, or multiply n-place numbers using modular 
arithmetic is essentially proportional to n (not counting the time to convert in 
and out of modular representation). This is no advantage at all when addition 
and subtraction are considered, but it can be a considerable advantage with 
respect to multiplication since the conventional method of Section 4.3.1 requires 
an execution time proportional to n?. 

Moreover, on a computer that allows many operations to take place simul- 
taneously, modular arithmetic can be a significant advantage even for addition 
and subtraction; the operations with respect to different moduli can all be done 
at the same time, so we obtain a substantial increase in speed. The same kind of 
decrease in execution time could not be achieved by the conventional techniques 
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discussed in the previous section, since carry propagation must be considered. 
Perhaps highly parallel computers will someday make simultaneous operations 
commonplace, so that modular arithmetic will be of significant importance in 
“real-time” calculations when a quick answer to a single problem requiring high 
precision is needed. (With highly parallel computers, it is often preferable to 
run k separate programs simultaneously, instead of running a single program k 
times as fast, since the latter alternative is more complicated but does not utilize 
the machine any more efficiently. “Real-time” calculations are exceptions that 
make the inherent parallelism of modular arithmetic more significant.) 

Now let us examine the basic fact that underlies the modular representation 
of numbers: 


Theorem C (Chinese Remainder Theorem). Let mı, M2, ..., Mp be positive 
integers that are relatively prime in pairs; that is, 


mj L mMk when j Æ k. (5) 


Let m = mym2...m,, and let a, u1, u2, ..., Ur be integers. Then there is 
exactly one integer u that satisfies the conditions 


as<u<a+m, and u = uj (modulo mj) forl<j <r. (6) 


Proof. If u = v (modulo m;) for 1 < j < r, then u — v is a multiple of m, for 
all j, so (5) implies that u — v is a multiple of m = mymz2...m,. This argument 
shows that there is at most one solution of (6). To complete the proof we must 
now show the existence of at least one solution, and this can be done in two 
simple ways: 


Method 1 (“Nonconstructive” proof). As u runs through the m distinct values 
a<u<a+m, the r-tuples (u mod mı, ...,u mod mr) must also run through 
m distinct values, since (6) has at most one solution. But there are exactly 
MM2... Mp possible r-tuples (v1,..., Ur) such that 0 < v; < mj. Therefore 
each r-tuple must occur exactly once, and there must be some value of u for 
which (u mod mı, ..., u mod Mmr) = (u1,...,Ur). 


Method 2 (“Constructive” proof). We can find numbers M, for 1 < j < r such 
that 


M; =1 (modulo mj) and M; = 0 (modulo mk) fork #j. (7) 


This follows because (5) implies that m; and m/m; are relatively prime, so we 
may take 
M; = (m/m;)*" (8) 


by Euler’s theorem (exercise 1.2.4-28). Now the number 


u=at ((u,Mı + ua Mə +---+u,M, — a) mod m) (9) 


satisfies all the conditions of (6). q 
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A very special case of this theorem was stated by the Chinese mathematician 
Sun Tsti, who gave a rule called tdi-yen (“great generalization”). The date of 
his writing is very uncertain; it is thought to be between A.D. 280 and 473. 
Mathematicians in medizeval India developed the techniques further, with their 
methods of kuttaka (see Section 4.5.2), but Theorem C was first stated and proved 
in its proper generality by Ch’in Chiu-Shao in his Shu Shu Chiu Chang (1247); 
the latter work considers also the case where the moduli might have common 
factors as in exercise 3. [See J. Needham, Science and Civilisation in China 3 
(Cambridge University Press, 1959), 33-34, 119-120; Y. Li and S. Du, Chinese 
Mathematics (Oxford: Clarendon, 1987), 92-94, 105, 161-166; K. Shen, Archive 
for History of Exact Sciences 38 (1988), 285-305.] Numerous early contributions 
to this theory have been summarized by L. E. Dickson in his History of the 
Theory of Numbers 2 (Carnegie Inst. of Washington, 1920), 57-64. 

As a consequence of Theorem C, we may use modular representation for 
numbers in any consecutive interval of m = m m 2...m, integers. For example, 
we could take a = 0 in (6), and work only with nonnegative integers u less 
than m. On the other hand, when addition and subtraction are being done, as 
well as multiplication, it is usually most convenient to assume that all of the 
moduli m1, M2, ..., My are odd numbers, so that m = m,m2...m, is odd, and 
to work with integers in the range 

m m 
-7 Su y (10) 
which is completely symmetrical about zero. 

In order to perform the basic operations listed in (2), (3), and (4), we 
need to compute (uj + vj) mod mj, (uj — vj) mod mj, and ujvj mod mj, when 
0 < ujv; < mj. If mj is a single-precision number, it is most convenient to 
form ujv; mod m; by doing a multiplication and then a division operation. For 
addition and subtraction, the situation is a little simpler, since no division is 
necessary; the following formulas may conveniently be used: 


(uj + vj) mod mj = uj + vj — m,[u; +v; > mj]. (11) 
(uj — vj) mod mj = uj — vj + mjlu; < vj]. (12) 


(See Section 3.2.1.1.) Since we want m to be as large as possible, it is easiest 
to let mı be the largest odd number that fits in a computer word, to let mz be 
the largest odd number < m; that is relatively prime to mı, to let m3 be the 
largest odd number < mə that is relatively prime to both mı and mg, and so on 
until enough m,’s have been found to give the desired range m. Efficient ways 
to determine whether or not two integers are relatively prime are discussed in 
Section 4.5.2. 

As a simple example, suppose that we have a decimal computer whose words 
hold only two digits, so that the word size is 100. Then the procedure described 
in the previous paragraph would give 


mı =99, m2=97, m3=95, m4=91, m5 =89, me =83, (13) 


and so on. 
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On binary computers it is sometimes desirable to choose the m, in a different 
way, by selecting 
me= 21, (14) 
In other words, each modulus is one less than a power of 2. Such a choice of 
m, often makes the basic arithmetic operations simpler, because it is relatively 
easy to work modulo 2° — 1, as in ones’ complement arithmetic. When the 
moduli are chosen according to this strategy, it is helpful to relax the condition 
0 <u; < mj slightly, so that we require only 


0 < uj; < 2%, uj = u (modulo 2% — 1). (15) 


Thus, the value uj = m; = 2° —1 is allowed as an optional alternative to uj = 0; 
this does not affect the validity of Theorem C, and it means we are allowing uj to 
be any e;-bit binary number. Under this assumption, the operations of addition 
and multiplication modulo mj become the following: 


uj vj = ((uj + vj) mod 2%) + [uj + vj > 2°]. (16) 
uj @ vj = (ujv; mod 2%)  [ujvz/2%]. (17) 


(Here ® and ® refer to the operations done on the individual components of 
(u1,..., Ur) and (v1,..., Ur) when adding or multiplying, respectively, using the 
convention (15).) Equation (12) is still good for subtraction, or we can use 


uj Ov; = ((uj — vj) mod 2%) — [uj < vj]. (18) 


These operations can be performed efficiently even when 2° is larger than the 
computer’s word size, since it is a simple matter to compute the remainder of a 
positive number modulo a power of 2, or to divide a number by a power of 2. 
In (17) we have the sum of the “upper half” and the “lower half” of the product, 
as discussed in exercise 3.2.1.1-8. 

If moduli of the form 2°) — 1 are to be used, we must know under what 
conditions the number 2°—1 is relatively prime to the number 2/—1. Fortunately, 
there is a very simple rule: 


gcd(2° — 1, 2f — 1) = 2aedlef) — 1, (19) 


This formula states in particular that 2° — 1 and 2/ — 1 are relatively prime if 
and only if e and f are relatively prime. Equation (19) follows from Euclid’s 
algorithm and the identity 


(2° — 1) mod (2 —1) = 2° ™94 f — 1, (20) 


(See exercise 6.) On a computer with word size 2°?, we could therefore choose 
mı = 237 — 1, ma = 294 — 1, mg = 27? — 1, ms = 27 — 1, ms = 275 — 1; this 
would permit efficient addition, subtraction, and multiplication of integers in a 
range of size mymzm3mams > 214°. 

As we have already observed, the operations of conversion to and from 
modular representation are very important. If we are given a number u, its 
modular representation (u1,..., ur) may be obtained by simply dividing u by 
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Mı, ---, My and saving the remainders. A possibly more attractive procedure, 
if u = (UmUm—1---Vo)p, is to evaluate the polynomial 

(. .. (Umb + Um-1)b +-+- -) b + vo 


using modular arithmetic. When b = 2 and when the modulus m; has the special 
form 2° —1, both of these methods reduce to quite a simple procedure: Consider 
the binary representation of u with blocks of e; bits grouped together, 


u = aA +as-1A1! +---+a,A+ ao, (21) 


where A = 2° and 0 < ap < 2°i for 0 < k <t. Then 


U = at + a1 +++: +a + ago (modulo 2% — 1), (22) 


since A = 1, so we obtain u; by adding the e;-bit numbers a; © --- © a1 © ao, 
using (16). This process is similar to the familiar device of “casting out nines” 
that determines u mod 9 when u is expressed in the decimal system. 

Conversion back from modular form to positional notation is somewhat more 
difficult. It is interesting in this regard to notice how the study of computation 
changes our viewpoint towards mathematical proofs: Theorem C tells us that the 
conversion from (u1,...,U,) to wis possible, and two proofs are given. The first 
proof we considered is a classical one that relies only on very simple concepts, 
namely the facts that 


i) any number that is a multiple of mı, of m2, ..., and of Mmr, must be a 
multiple of m;mz...m, when the m,’s are pairwise relatively prime; and 

ii) if m pigeons are put into m pigeonholes with no two pigeons in the same 
hole, then there must be one in each hole. 


By traditional notions of mathematical aesthetics, this is no doubt the nicest 
proof of Theorem C; but from a computational standpoint it is completely 
worthless. It amounts to saying, “Try u = a, a+ 1, ... until you find a value for 
which u = u1 (modulo mı), ..., u = ur (modulo m,).” 

The second proof of Theorem C is more explicit; it shows how to compute r 
new constants My, ..., Mr, and to get the solution in terms of these constants 
by formula (9). This proof uses more complicated concepts (for example, Euler’s 
theorem), but it is much more satisfactory from a computational standpoint, 
since the constants M1, ..., M, need to be determined only once. On the 
other hand, the determination of M; by Eq. (8) is certainly not trivial, since the 
evaluation of Euler’s y-function requires, in general, the factorization of my into 
prime powers. There are much better ways to compute M; than to use (8); in 
this respect we can see again the distinction between mathematical elegance and 
computational efficiency. But even if we find M; by the best possible method, 
we’re stuck with the fact that M; is a multiple of the huge number m/m,;. Thus, 
(9) forces us to do a lot of high-precision calculation, and such calculation is just 
what we wished to avoid by modular arithmetic in the first place. 

So we need an even better proof of Theorem C if we are going to have a 
really usable method of conversion from (u1,..., Ur) to u. Such a method was 
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suggested by H. L. Garner in 1958; it can be carried out using (5) constants cj; 
for 1 <i < j <r, where 
cj Mm; =1 (modulo m;). (23) 
These constants cj; are readily computed using Euclid’s algorithm, since for any 
given i and j Algorithm 4.5.2X will determine a and b such that am; + bm; = 
ged(m;,m,;) = 1, and we may take ci; = a. When the moduli have the special 
form 2° — 1, a simple method of determining cij is given in exercise 6. 
Once the c;; have been determined satisfying (23), we can set 
vı + uz mod mı, 
Uz — (u2 — V1) C12 mod M2, 


v3 — ((us = v) C13 — v2) C23 mod m3, 


(24) 
Ur 4+ (. .. ((Ur — U1) Cir — V2) Cap = — Ur—1) C(r—1)r mod My. 
Then 
U = VrMr—1 -MM1 + +++ + vu3gmgm, + vem, +v (25) 
is a number satisfying the conditions 
O<u<m, u = uj (modulo mj) for 1<j <r. (26) 


(See exercise 8; another way of rewriting (24) that does not involve as many 
auxiliary constants is given in exercise 7.) Equation (25) is a mixed-radix repre- 
sentation of u, which can be converted to binary or decimal notation using the 
methods of Section 4.4. If 0 < u < m is not the desired range, an appropriate 
multiple of m can be added or subtracted after the conversion process. 

The advantage of the computation shown in (24) is that the calculation 
of vj can be done using only arithmetic mod m,, which is already built into the 
modular arithmetic algorithms. Furthermore, (24) allows parallel computation: 
We can start with (v1,..., Ur) < (u1 mod mı,..., ur mod mr), then at time j 
for 1 < j < r we simultaneously set vp < (Uk — vj) Cjk mod Mk for j < k <r. 
An alternative way to compute the mixed-radix representation, allowing similar 
possibilities for parallelism, has been discussed by A. S. Fraenkel, Proc. ACM 
Nat. Conf. 19 (Philadelphia: 1964), 1.4. 

It is important to observe that the mixed-radix representation (25) is suffi- 
cient to compare the magnitudes of two modular numbers. For if we know that 
0<u< mand0O < u’ < m, then we can tell if u < u’ by first doing the 
conversion to (v;,...,U,) and (v1, ..., u4), then testing if uv, < vi, or if v, = vf}. 
and v,_; < vj._, etc., according to lexicographic order. It is not necessary 
to convert all the way to binary or decimal notation if we only want to know 
whether (u,,...,u,) is less than (u1, ..., w). 

The operation of comparing two numbers, or of deciding if a modular number 
is negative, is intuitively very simple, so we would expect to have a much easier 
way to make this test than the conversion to mixed-radix form. But the following 
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theorem shows that there is little hope of finding a substantially better method, 
since the range of a modular number depends essentially on all bits of all the 
residues (uj,..., Ur): 


Theorem S (Nicholas Szabó, 1961). In terms of the notation above, assume 
that mı < ym, and let L be any value in the range 


m <L<m-my. (27) 


Let g be any function such that the set {g(0),g(1),...,g(mi—1)} contains fewer 
than mı values. Then there are numbers u and v such that 


g(umod m1) = g(v mod mı), wmodm;=vmodm, for2<j<r; (28) 
O<u<L<v<m. (29) 


Proof. By hypothesis, there must exist numbers u # v satisfying (28), since 
g must take on the same value for two different residues. Let (u,v) be a pair 
of values with 0 < u < v < m satisfying (28), for which u is a minimum. 
Since u’ = u — mı and v’ = v — m; also satisfy (28), we must have u’ < 0 by 
the minimality of u. Hence u < mı < L; and if (29) does not hold, we must 
have v < L. But v > u, and v — u is a multiple of mo... M, = m/m4, so 
v >v—u > m/m > mı. Therefore, if (29) does not hold for (u,v), it will be 
satisfied for the pair (u”, v”) = (v — mı, u+ m- mı). I 


Of course, a similar result can be proved for any my in place of mı; and we 
could also replace (29) by the condition “a < u < a+ L < v < a+ m” with 
only minor changes in the proof. Therefore Theorem S shows that many simple 
functions cannot be used to determine the range of a modular number. 


Let us now reiterate the main points of the discussion in this section: Mod- 
ular arithmetic can be a significant advantage for applications in which the pre- 
dominant calculations involve exact multiplication (or raising to a power) of 
large integers, combined with addition and subtraction, but where there is very 
little need to divide or compare numbers, or to test whether intermediate results 
“overflow” out of range. (It is important not to forget the latter restriction; 
methods are available to test for overflow, as in exercise 12, but they are so 
complicated that they nullify the advantages of modular arithmetic.) Several 
applications of modular computations have been discussed by H. Takahasi and 
Y. Ishibashi, Information Proc. in Japan 1 (1961), 28-42. 

An example of such an application is the exact solution of linear equations 
with rational coefficients. For various reasons it is desirable in this case to assume 
that the moduli m1, M2, ..., Mp are all prime numbers; the linear equations can 
be solved independently modulo each m;. A detailed discussion of this procedure 
has been given by I. Borosh and A. S. Fraenkel [Math. Comp. 20 (1966), 107- 
112], with further improvements by A. S. Fraenkel and D. Loewenthal [J. Res. 
National Bureau of Standards 75B (1971), 67-75]. By means of their method, 
the nine independent solutions of a system of 111 linear equations in 120 un- 
knowns were obtained exactly in less than 20 minutes on a CDC 1604 computer. 
The same procedure is worthwhile also for solving simultaneous linear equations 
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with floating point coefficients, when the matrix of coefficients is ill-conditioned. 
The modular technique (treating the given floating point coefficients as exact 
rational numbers) gives a method for obtaining the true answers in less time 
than conventional methods can produce reliable approximate answers! [See M. T. 
McClellan, JACM 20 (1973), 563-588, for further developments of this approach; 
and see also E. H. Bareiss, J. Inst. Math. and Appl. 10 (1972), 68-104, for a 
discussion of its limitations. ] 

The published literature concerning modular arithmetic is mostly oriented 
towards hardware design, since the carry-free properties of modular arithmetic 
make it attractive from the standpoint of high-speed operation. The idea was 
first published by A. Svoboda and M. Valach in the Czechoslovakian journal 
Stroje na Zpracování Informací (Information Processing Machines) 3 (1955), 
247-295; then independently by H. L. Garner [IRE Trans. EC-8 (1959), 140- 
147]. The use of moduli of the form 2° — 1 was suggested by A. S. Fraenkel 
[JACM 8 (1961), 87-96], and several advantages of such moduli were demon- 
strated by A. Schonhage [Computing 1 (1966), 182-196]. See the book Residue 
Arithmetic and Its Applications to Computer Technology by N. S. Szabó and 
R. I. Tanaka (New York: McGraw-Hill, 1967), for additional information and a 
comprehensive bibliography of the subject. A Russian book published in 1968 
by I. Y. Akushsky and D. I. Yuditsky includes a chapter about complex moduli 
[see Rev. Roumaine de Math. Pures et Appl. 15 (1970), 159-160]. 

Further discussion of modular arithmetic can be found in Section 4.3.3B. 


The notice-board had said he was in Room 423, 

but the numbering system, nominally consecutive, 
seemed to have been applied on a plan that could only 
have been the work of a lunatic or a mathematician. 


— ROBERT BARNARD, The Case of the Missing Bronté (1983) 


EXERCISES 


1. [20] Find all integers u that satisfy all of the following conditions: u mod 7 = 1, 
umod 11 = 6, u mod 13 = 5, 0 < u < 1000. 


2. [M20] Would Theorem C still be true if we allowed a, wi, u2, ..., Ur and u to be 
arbitrary real numbers (not just integers)? 


> 3. [M26] (Generalized Chinese Remainder Theorem.) Let mi, M2, ..., Mr be posi- 
tive integers. Let m be the least common multiple of mi, M2, ..., m,, and let a, wi, 
u2,..., Ur be any integers. Prove that there is exactly one integer u that satisfies the 
conditions 
a<u<a+m, u= uj (modulo mj), 1<j<r, 


provided that 
ui =u; (modulo ged(m:,m;)), l<i<g<r; 
and there is no such integer u when the latter condition fails to hold. 


4. [20] Continue the process shown in (13); what would m7, ms, mo, ... be? 


> 5. [M23] (a) Suppose that the method of (13) is continued until no more m; can be 
chosen. Does this “greedy” method give the largest attainable value mimz2...m, such 
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that the m; are odd positive integers less than 100 that are relatively prime in pairs? 
(b) What is the largest possible mim2...m, when each residue u; must fit in eight 
bits of memory? 


6. [M22] Let e, f, and g be nonnegative integers. 
a) Show that 2° = 2/ (modulo 2% — 1) if and only if e = f (modulo g). 
b) Given that emod f = d and ce mod f = 1, prove the identity 


(C2? cep SY) . (2° — 1)) mod (2? — 1) = 1. 


(Thus, we have a comparatively simple formula for the inverse of 2° — 1, modulo 
2f — 1, as required in (23).) 


> 7. [M21] Show that (24) can be rewritten as follows: 
vı + ui mod mı, 
U2 <— (u2 = v1) C12 mod m2, 


U3 <— (us = (v1 + miv2)) C13C23 mod m3, 


Ur 4 (ur (v1 + Mı (v2 + Mm2(v3 +- -4 Mr—20r—1) --- ))) cir +++ Cr—1)r mod mr. 


If the formulas are rewritten in this way, we see that only r — 1 constants Cj = 
C1j -- - C(j—1)j mod my are needed instead of r(r — 1)/2 constants cij as in (24). Discuss 
the relative merits of this version of the formula as compared to (24), from the stand- 
point of computer calculation. 


8. [M21] Prove that the number u defined by (24) and (25) satisfies (26). 


9. [M20] Show how to go from the values v1, ..., Ur of the mixed-radix notation (25) 
back to the original residues u1, ..., ur, using only arithmetic mod m; to compute the 
value of uj. 


10. [M25] An integer u that lies in the symmetrical range (10) might be represented 
by finding the numbers ui, ..., ur such that u = uj (modulo m;) and =m;/2 < 
uj < mj/2, instead of insisting that 0 < uj < mj as in the text. Discuss the modular 
arithmetic procedures that would be appropriate in connection with such a symmetrical 
representation (including the conversion process, (24)). 


11. [M239] Assume that all the m; are odd, and that u = (w1,...,ur) is known to be 
even, where 0 < u < m. Find a reasonably fast method to compute u/2 using modular 
arithmetic. 


12. [M10] Prove that, if 0 < u,v < m, the modular addition of u and v causes overflow 
(lies outside the range allowed by the modular representation) if and only if the sum 
is less than u. (Thus the overflow detection problem is equivalent to the comparison 
problem.) 


> 13. [M25] (Automorphic numbers.) An n-digit decimal number x > 1 is called an 
“automorph” by recreational mathematicians if the last n digits of x? are equal to z. 
For example, 9376 is a 4-digit automorph, since 9376” = 87909376. [See Scientific 
American 218,1 (January 1968), 125.] 


a) Prove that an n-digit number x > 1 is an automorph if and only if z mod 5” = 
0 or 1 and z mod 2” = 1 or 0, respectively. (Thus, if mı = 2” and m2 = 5”, the 
only two n-digit automorphs are the numbers M and Mz in (7).) 
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b) Prove that if x is an n-digit automorph, then (3x2? — 22°) mod 10?” is a 2n-digit 
automorph. 
c) Given that cx = 1 (modulo y), find a simple formula for a number c’ depending 
on cand x but not on y, such that c'z? = 1 (modulo y’). 
> 14. [M30] (Mersenne multiplication.) The cyclic convolution of (£0, £1, .. . , &n—1) and 
(yo, ¥1;---;Yn—1) is defined to be (zo, 21,..-,2n—1), where 


zp = 5 TiYj, forO<k<n. 


i+j=k (modulo n) 


We will study efficient algorithms for cyclic convolution in Sections 4.3.3 and 4.6.4. 
Consider q-bit integers u and v that are represented in the form 


n—-1 n—-1 
w= So up2ltrl, y= So vpl, 
k=0 k=0 


where 0 < Uk, Uk < gltk+1)a/n]—lka/n] | (This representation is a mixture of radices 
2la/") and 2!a/n1 .) Suggest a good way to find the representation of 


w = (uv) mod (27 — 1), 


using an appropriate cyclic convolution. [Hint: Do not be afraid of floating point 
arithmetic.] 


*4.3.3. How Fast Can We Multiply? 


The conventional procedure for multiplication in positional number systems, Al- 
gorithm 4.3.1M, requires approximately cmn operations to multiply an m-place 
number by an n-place number, where c is a constant. In this section, let us 
assume for convenience that m = n, and let us consider the following question: 
Does every general computer algorithm for multiplying two n-place numbers 
require an execution time proportional to n?, as n increases? 

(In this question, a “general” algorithm means one that accepts, as input, 
the number n and two arbitrary n-place numbers in positional notation; the 
algorithm is supposed to output their product in positional form. Certainly if 
we were allowed to choose a different algorithm for each value of n, the question 
would be of no interest, since multiplication could be done for any specific value 
of n by a “table-lookup” operation in some huge table. The term “computer 
algorithm” is meant to imply an algorithm that is suitable for implementation 
on a digital computer like MIX, and the execution time is to be the time it takes 
to perform the algorithm on such a computer.) 


A. Digital methods. The answer to the question above is, rather surprisingly, 
“No,” and, in fact, it is not very difficult to see why. For convenience, let 
us assume throughout this section that we are working with integers expressed 
in binary notation. If we have two 2n-bit numbers u = (ugn-1...U1Uo)2 and 
v = (Van—-1---U1U0)2, We can write 


u = 2"U, + Uo, v = 2" Vi + Vo, (1) 
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where U1 = (uon_1---Un)2 is the “most significant half” of the number u and 
Up = (Un—1---Uo)2 is the “least significant half”; similarly Vj = (van_1---Un)2 
and Vo = (Un-1---Vo)2. Now we have 


uv = (2° + 2")0,Vy + 2”(U1 — Up)(Vo — V1) + 27 + DUN. (2) 


This formula reduces the problem of multiplying 2n-bit numbers to three mul- 
tiplications of n-bit numbers, namely U1 Vı, (U1 — Uo)(Vo — Vi), and UoVo, plus 
some simple shifting and adding operations. 

Formula (2) can be used to multiply double-precision inputs when we want 
a quadruple-precision result, and it will be just a little faster than the traditional 
method on many machines. But the main advantage of (2) is that we can use 
it to define a recursive process for multiplication that is significantly faster than 
the familiar order-n? method when n is large: If T(n) is the time required to 
perform multiplication of n-bit numbers, we have 


T(2n) < 3T(n) + en (3) 


for some constant c, since the right-hand side of (2) uses just three multiplications 
plus some additions and shifts. Relation (3) implies by induction that 


TO*\<e3r=2%, k21, (4) 


if we choose c to be large enough so that this inequality is valid when k = 1; 
therefore we have 


T(n) < Tale") < e(3187 — 20871) < 36-38" = 3cn'8?, (5) 


Relation (5) shows that the running time for multiplication can be reduced from 
order n? to order n!83 ~ n!-585, so the recursive method is much faster than the 
traditional method when n is large. Exercise 18 discusses an implementation of 
this approach. 

(A similar but slightly more complicated method for doing multiplication 
with running time of order n!83 was apparently first suggested by A. Karatsuba 
in Doklady Akad. Nauk SSSR 145 (1962), 293-294 [English translation in Soviet 
Physics—Doklady 7 (1963), 595-596]. Curiously, this idea does not seem to 
have been discovered before 1962; none of the “calculating prodigies” who have 
become famous for their ability to multiply large numbers mentally have been 
reported to use any such method, although formula (2) adapted to decimal 
notation would seem to lead to a reasonably easy way to multiply eight-digit 
numbers in one’s head.) 

The running time can be reduced still further, in the limit as n approaches 
infinity, if we observe that the method just used is essentially the special case 
r = 1 of a more general method that yields 


T((r +1)n) < (2r+1)T(n) + cn (6) 
for any fixed r. This more general method can be obtained as follows: Let 


U = (U(r41)n—1 - - - U1U0)2 and V = (Vfr+1)n—1 - - - V1v0)2 
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be broken into r + 1 pieces, 
u = U,2°" +++. + U12” + Uo, v = V2" +. + V12” + Vo, (7) 


where each U; and each V; is an n-bit number. Consider the polynomials 


U(x) = U,2" ++--+ Ux + Uo, V(a)= Vea" +> +Vie+Vo, (8) 


and let 
W(x) = U(x)V(x) = Worn" +--+ Wix + Wo. (9) 


Since u = U(2") and v = V(2"), we have uv = W(2”), so we can easily compute 
uv if we know the coefficients of W(x). The problem is to find a good way 
to compute the coefficients of W(x) by using only 2r + 1 multiplications of n- 
bit numbers plus some further operations that involve only an execution time 
proportional to n. This can be done by computing 


U(0)V(0) =W(0), UGW) =W(1), ..., UQr)V(2Qr) =W(2r). (10) 


The coefficients of a polynomial of degree 2r can be written as a linear com- 
bination of the values of that polynomial at 2r + 1 distinct points; computing 
such a linear combination requires an execution time at most proportional to n. 
(Actually, the products U(j)V(j) are not strictly products of n-bit numbers, 
but they are products of at most (n + t)-bit numbers, where t is a fixed value 
depending on r. It is easy to design a multiplication routine for (n + t)-bit 
numbers that requires only T(n) + cın operations, where T(n) is the number of 
operations needed for n-bit multiplications, since two products of t-bit by n-bit 
numbers can be done in cən operations when t is fixed.) Therefore we obtain a 
method of multiplication satisfying (6). 

Relation (6) implies that T(n) < cgni&r+i12"tD < egn!t8r+12, if we argue 
as in the derivation of (5), so we have now proved the following result: 


Theorem A. Given e > 0, there exists a multiplication algorithm such that the 
number of elementary operations T(n) needed to multiply two n-bit numbers 
satisfies 

T(n) < c(e)n'**, (11) 
for some constant c(€) independent ofn. | 


This theorem is still not the result we are after. It is unsatisfactory for 
practical purposes because the method becomes quite complicated as € > 0 
(and therefore as r — oo), causing c(€) to grow so rapidly that extremely huge 
values of n are needed before we have any significant improvement over (5). And 
it is unsatisfactory for theoretical purposes because it does not make use of the 
full power of the polynomial method on which it is based. We can obtain a better 
result if we let r vary with n, choosing larger and larger values of r as n increases. 
This idea is due to A. L. Toom [Doklady Akad. Nauk SSSR 150 (1963), 496- 
498, English translation in Soviet Mathematics 4 (1963), 714-716], who used it 
to show that computer circuitry for the multiplication of n-bit numbers can be 
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constructed with a fairly small number of components as n grows. S. A. Cook 
[On the Minimum Computation Time of Functions (Thesis, Harvard University, 
1966), 51-77] showed later that Toom’s method can be adapted to fast computer 
programs. 

Before we discuss the Toom—Cook algorithm any further, let us study a small 
example of the transition from U(x) and V(x) to the coefficients of W (x). This 
example will not demonstrate the efficiency of the method, since the numbers 
are too small, but it reveals some useful simplifications that we can make in the 
general case. Suppose that we want to multiply u = 1234 times v = 2341; in 
binary notation this is 


u = (0100 1101 0010). times v = (1001 0010 0101). (12) 
Let r = 2; the polynomials U(x) and V(x) in (8) are 
U(x) = 4x? + 132 + 2, V(x) = 9a? + 2z +5. 
Hence we find, for W(x) = U(x)V (x), 


U(0)= 2, U(l)= 19, U(2)= 44, U(3)= 77, U(4)= 118; 
V(0)= 5, V(l)= 16, V(2)= 45, V(3)= 92, V(4)= 157; 
W(0) =10, W(1) = 304, W(2) = 1980, W(3) = 7084, W(4) = 18526. (13) 


2, 
5, 


Our job is to compute the five coefficients of W (x) from the latter five values. 


An attractive little algorithm can be used to compute the coefficients of a 
polynomial W (x) = Wm—12™71 +- - -+ Wix + Wo when the values W (0), W (1), 
.., W(m — 1) are given. Let us first write 


W (x) = am-1 0" + am-2 1 + +++ + axt + ao, (14) 


where zë = x(a — 1)... (x — k + 1), and where the coefficients a; are unknown. 


The falling factorial powers have the important property that 
W(2+4+ 1) — W(x) = (m -— 1)am-1 12 + (m — 2)am-2 12 +--+ a1; 


hence by induction we find that, for all k > 0, 


a (were - (were + ($)We+4 2) + Dwe) 


m—1 i- m—2 a k 
= ( k ) dy 1 aE + ( k ) dy po = 4.. (p) a. (15) 


Denoting the left-hand side of (15) by (1/k!) Æ W (x), we see that 
1 1 1 1 
vA W = — _ Akt W 1 == Aly 
k! (2) k (z — 1)! (PFA (k — 1)! (x 
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and (1/k!) Æ W(0) = ap. So the coefficients aj can be evaluated using a very 
simple method, illustrated here for the polynomial W (x) in (13): 


10 


304 oi 1382/2= 691 

1676 E 1023/3 = 341 
1980 3428/2 = 1714 / 144/4 = 36 (16) 
7084 a104 6338/2 = 3169 ospi 

11442 


18526 


The leftmost column of this tableau is a listing of the given values of W (0), 
W(1), ..., W(4); the kth succeeding column is obtained by computing the 
difference between successive values of the preceding column and dividing by k. 
The coefficients a; appear at the top of the columns, so that a9 = 10, a; = 294, 
.., @4 = 36, and we have 
W(x) = 3624 + 34123 + 691x? + 29424 + 10 
= (((36(a — 3) + 341)(a — 2) + 691) (x — 1) + 294)xz +10. (a7) 


In general, we can write 


W(x)=(...((am-1(£—-M+2)+am-2)(1—-Mm+3)+am-3)(x—-m+4)+:- + +a1)£+ao, 


and this formula shows how the coefficients W,,_1, ..., Wi, Wo can be obtained 
from the a’s: 


36 341 
—3 + 36 
36 233 691 
—2-36 | —2-233 (18) 
36 161 225 294 
-1-36 -1-161 | —1-225 
36 125 64 69 | 10 


Here the numbers below the horizontal lines successively show the coefficients of 
the polynomials 
am-1; 
Cin te -m+ 2) + am-2, 
(am-1(£ — m+ 2) + Gm—2)(« -m + 3) + dm_—3, etc. 
From this tableau we have 
W (x) = 36x4 + 12523 + 642? + 69a + 10, (19) 
so the answer to our original problem is 1234 - 2341 = W(16) = 2888794, where 
W (16) is obtained by adding and shifting. A generalization of this method for 
obtaining coefficients is discussed in Section 4.6.4. 
The basic Stirling number identity of Eq. 1.2.6—-(45), 


ora {barge (bate (Ph, 
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shows that if the coefficients of W(x) are nonnegative, so are the numbers aj, 
and in such a case all of the intermediate results in the computation above are 
nonnegative. This further simplifies the Toom—Cook multiplication algorithm, 
which we will now consider in detail. (Impatient readers should, however, skip 
to subsection C below.) 


Algorithm T (High-precision multiplication of binary numbers). Given a pos- 
itive integer n and two nonnegative n-bit integers u and v, this algorithm forms 
their 2n-bit product, w. Four auxiliary stacks are used to hold the long numbers 
that are manipulated during the procedure: 


Stacks U, V: Temporary storage of U(j) and V(j) in step T4. 
Stack C: Numbers to be multiplied, and control codes. 
Stack W: Storage of W (j). 


These stacks may contain either binary numbers or special control symbols called 
code-1, code-2, and code-3. The algorithm also constructs an auxiliary table of 
numbers qk, Tx; this table is maintained in such a manner that it may be stored 
as a linear list, where a single pointer that traverses the list (moving back and 
forth) can be used to access the current table entry of interest. 

(Stacks C and W are used to control the recursive mechanism of this multi- 
plication algorithm in a reasonably straightforward manner that is a special case 
of general procedures discussed in Chapter 8.) 


T1. [Compute q,r tables.] Set stacks U, V, C, and W empty. Set 
kel, qo & qı « 16, ro ri + 4, Q <4, R&e2. 
Now if qk-1 + qk < n, set 
kek+1, Q¢Q+R, Re |VJVQ|, a9 2% m2", 


and repeat this operation until q,_1 + qk > n. (Note: The calculation of 
R + | VQ | does not require a square root to be taken, since we may simply 
set R+ R+1 if (R+1)? < Q and leave R unchanged if (R + 1)? > Q; 
see exercise 2. In this step we build the sequences 
k= 0 1 2 3 4 5 6 
dk = 94 94 96 98 910 913 916 
Th = 2? 2? 2? 2? 23 23 24 
The multiplication of 70000-bit numbers would cause this step to terminate 
with k = 6, since 70000 < 213 + 26) 
T2. [Put u, v on stack.] Put code-1 on stack C, then place u and v onto stack C 
as numbers of exactly q,_1 + qk bits each. 
T3. [Check recursion level.] Decrease k by 1. If k = 0, the top of stack C now 
contains two 32-bit numbers, u and v; remove them, set w + uv using 
a built-in routine for multiplying 32-bit numbers, and go to step T10. If 
k > 0, set r + rk, q + qk, P+ qk-1 + qk, and go on to step T4. 
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T4. 


T5. 


T6. 


T7. 


T1. Compute | | T2. Put u, v T3. Check \k=0 
q, r tables on stack recursion level 
k>0 


T4. Break into 
r+1 parts 


T6. Save 


[>| T5. Recurse one product 


de-3 
code-3 J, 


T7. Find a’s œ| T8. Find W’s | T9. Set answer >| T10. Return 


iT p code-1 
ode-2 Y 


C 


Fig. 8. The Toom-Cook algorithm for high-precision multiplication. 


[Break into r +1 parts.] Let the number at the top of stack C be regarded 
as a list of r + 1 numbers with q bits each, (U, ...U1Uo)29. (The top of 
stack C now contains an (r + 1)q = (qk + qk+1)-bit number.) For j = 0, 1, 
..., 2r, compute the p-bit numbers 


(...(Urj + Ur-1)j + + U1) j + Uo = U (i) 


and successively put these values onto stack U. (The bottom of stack U 
now contains U (0), then comes U (1), etc., with U (2r) on top. We have 


U(j) < U(2r) < 29((2r)" + (2r) Tt +- +1) < 2912r)" < 2, 


by exercise 3.) Then remove U, ...U1Uọo from stack C. 
Now the top of stack C contains another list of r + 1 q-bit numbers, 
V,...Vi Vo, and the p-bit numbers 


(...(Vig + Vr1)j + + Vi) + Vo =VU) 
should be put onto stack V in the same way. After this has been done, 
remove V, ... Vı Vo from stack C. 
[Recurse.] Successively put the following items onto stack C, at the same 
time emptying stacks U and V: 
code-2, V (2r), U(2r), code-3, V (2r — 1), U (2r — 1), ..., 
code-3, V (1), U (1), code-3, V (0), U (0). 
Go back to step T3. 


[Save one product.] (At this point the multiplication algorithm has set w 
to one of the products W(j) = U(j)V(j).) Put w onto stack W. (This 
number w contains 2(q, + qk—1) bits.) Go back to step T3. 

[Find a’s.] Set r 4+ rk, q <— qk, p & qk-1 + qk. (At this point stack W 
contains a sequence of numbers ending with W (0), W (1), ..., W (2r) from 
bottom to top, where each W (j) is a 2p-bit number.) 
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Now for j = 1, 2, 3, ..., 2r, perform the following loop: For t = 2r, 
2r — 1, 2r— 2, ..., j, set W(t) + (W(t) — W(t -1))/j. (Here j must 
increase and t must decrease. The quantity (W(t) -W(t— 1)) /j will always 
be a nonnegative integer that fits in 2p bits; see (16).) 

T8. [Find W’s.] For j = 2r — 1, 2r — 2, ..., 1, perform the following loop: For 
t=j,jtl,..., 2r — 1, set W(t) W(t)—jW(t+1). (Here j must 
decrease and t must increase. The result of this operation will again be a 
nonnegative 2p-bit integer; see (18).) 

T9. [Set answer.] Set w to the 2(qk + qk+1)-bit integer 

(... (W(2r)24 + W (2r — 1))24 + -- -+ W (1))29 + W (0). 
Remove W (2r), ..., W (0) from stack W. 


T10. [Return.] Set k + k +1. Remove the top of stack C. If it is code-3, go to 
step T6. If it is code-2, put w onto stack W and go to step T7. And if it 
is code-1, terminate the algorithm (w is the answer). | 


Let us now estimate the running time, T(n), for Algorithm T, in terms 
of some things we shall call “cycles,” that is, elementary machine operations. 
Step T1 takes O(q,) cycles, even if we represent the number gq, internally as a 
long string of gq, bits followed by some delimiter, since qk + qk-1 +++: + qo will 
be O(q,). Step T2 obviously takes O(qx) cycles. 

Now let tk denote the amount of computation required to get from step T3 
to step T10 for a particular value of k (after k has been decreased at the 
beginning of step T3). Step T3 requires O(q) cycles at most. Step T4 involves r 
multiplications of g-bit numbers by (lg 2r)-bit numbers, and r additions of p-bit 
numbers, all repeated 4r + 2 times. Thus we need a total of O(r?qlogr) cycles. 
Step T5 requires moving 4r+2 p-bit numbers, so it involves O(rq) cycles. Step T6 
requires O(q) cycles, and it is done 2r + 1 times per iteration. The recursion 
involved when the algorithm essentially invokes itself (by returning to step T3) 
requires tz, cycles, 2r + 1 times. Step T7 requires O(r?) subtractions of p-bit 
numbers and divisions of 2p-bit by (lg 2r)-bit numbers, so it requires O(r7q log r) 
cycles. Similarly, step T8 requires O(r?qlogr) cycles. Step T9 involves O(rq) 
cycles, and T10 takes hardly any time at all. 

Summing up, we have T(n) = O(qx) + O(qk) + te-1, where (if q = qk and 
r = rk) the main contribution to the running time satisfies 


tk = O(q) + O(r?qlogr) + O(rq) + (2r + 1)O(q) + O(r?qlogr) 
+ O(r*qlogr) + O(rq) + O(g) + (2r + 1)tr-ı 
= O(r?qlogr) + (2r + 1)tp_-1. 
Thus there is a constant c such that 
tk < ergae lg re + (2rp + 1)tk—1- 
To complete the estimation of t, we can prove by brute force that 


tk SO Oe (20) 
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for some constant C. Let us choose C > 20c, and let us also take C large enough 
so that (20) is valid for k < ko, where ko will be specified below. Then when 
k > ko, let Qk = lg qx, Rk = lg rg; we have by induction 


tk < cqr? lg Tr+ (2rk + 1)Cqy2? Ve = Ciao" lg de+1 (yyy + n2), 


where 1 
M = Rp2 25V — RI Pe < 0.05, 
C 20 
n2 = (2 4 Spavam zg z 0.85, 
Tk 
since 


VQ — Vk = VQk+|VQk] -VQ >} 


as k > oo. It follows that we can find ko such that 72 < 0.95 for all k > ko, and 
this completes the proof of (20) by induction. 

Finally, therefore, we are ready to estimate T(n). Since n > qk-1 + qk-2, 
we have qx—1 < n; hence 


rey = 2L 89-1] < QV ln and visn 


dk = Tk-19k—-1 < N2 


Thus 
tea < OCqp22 5V 8% < Cn2V18” +2-5(V1sn +1) 


and, since T(n) = O(qx) + tk-1, we have derived the following theorem: 


Theorem B. There is a constant co such that the execution time of Algorithm T 
is less than cgn2°°V'8" cycles. | 


Since n235V lsn — n1+3.5/Vien this result is noticeably stronger than The- 
orem A. By adding a few complications to the algorithm, pushing the ideas to 
their apparent limits (see exercise 5), we can improve the estimated execution 
time to 


T(n) = O(n2V?8" logn). (21) 


*B. A modular method. There is another way to multiply large numbers very 
rapidly, based on the ideas of modular arithmetic as presented in Section 4.3.2. 
It is very hard to believe at first that this method can be of advantage, since a 
multiplication algorithm based on modular arithmetic must include the choice of 
moduli and the conversion of numbers into and out of modular representation, 
besides the actual multiplication operation itself. In spite of these formidable 
difficulties, A. Schénhage discovered that all of these operations can be carried 
out quite rapidly. 

In order to understand the essential mechanism of Schénhage’s method, we 
shall look at a special case. Consider the sequence defined by the rules 


qo = 1, dk+1 = 84% — 1, (22) 


so that gq, = 3% — 38-1 —..--1 = $(3* +1). We will study a procedure 
that multiplies p,-bit numbers, where pp = (18q, + 8), in terms of a method 
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for multiplying p,_,-bit numbers. Thus, if we know how to multiply numbers 
having po = 26 bits, the procedure to be described will show us how to multiply 
numbers of pı = 44 bits, then 98 bits, then 260 bits, etc., eventually increasing 
the number of bits by almost a factor of 3 at each step. 

When multiplying pp-bit numbers, the idea is to use the six moduli 


my = 28%-1_1, my =28%41_1, mg = 26042 1 


ma = 28%43 1, mg = 26%45 1, mg = 26447 J, 


These moduli are relatively prime, by Eq. 4.3.2—(19), since the exponents 
6qk— 1, 6qgk +1, 6qk +2, 6qk +3, 6qe+5, 6qR+7 (24) 
are always relatively prime (see exercise 6). The six moduli in (23) are capable 
of representing numbers up to m = mymgm3mams5me > 2364+16 — 22Pk, so 
there is no chance of overflow in the multiplication of pp-bit numbers u and v. 


Thus we can use the following method, when k > 0: 


a) Compute u; = umod mı, ..., ue = umod me; and vı = v mod mı, .. 
Ug = vmod me. 


La 


b) Multiply uı by v1, u2 by v2, ..., ug by vg. These are numbers of at most 
6qk + 7 = 18qk—-1 + 1 < pk—ı bits, so the multiplications can be performed 
by using the assumed ppk—1ı-bit multiplication procedure. 


c) Compute w = uiv mod m1, we = ugv2 mod mg, ..., We = UgVe Mod Me. 
d) Compute w such that 0 < w < m, w mod mı = w1, ..., w mod Me = we. 


Let t be the amount of time needed for this process. It is not hard to see that 
operation (a) takes O(p,) cycles, since the determination of u mod (2°—1) is quite 
simple (like “casting out nines”), as shown in Section 4.3.2. Similarly, operation 
(c) takes O(p,) cycles. Operation (b) requires essentially 6t,_1 cycles. This 
leaves us with operation (d), which seems to be quite a difficult computation; 
but Schénhage has found an ingenious way to perform step (d) in O(p, log pp) 
cycles, and this is the crux of the method. As a consequence, we have 


tk = 6th-1 + O(px log px). 
Since pp = 3*+? + 17, we can show that the time for n-bit multiplication is 
T(n) = O(n'83°) = O(n"). (25) 


(See exercise 7.) 

Although the modular method is more complicated than the O(n'8*) pro- 
cedure discussed at the beginning of this section, Eq. (25) shows that it does, 
in fact, lead to an execution time substantially better than O(n?) for the multi- 
plication of n-bit numbers. Thus we have seen how to improve on the classical 
method by using either of two completely different approaches. 

Let us now analyze operation (d) above. Assume that we are given a set of 
positive integers e; < eg < ++- < er, relatively prime in pairs; let 


my, = 2% — 1, M = 2% — 1, es Mp = 2° — 1. (26) 
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We are also given numbers w1, ..., Wp such that 0 < w; < mj. Our job is to 
determine the binary representation of the number w that satisfies the conditions 


0O<w<miym2...M,, 


w = w (modulo mı), erii w = w, (modulo m,). (27) 
The method is based on (24) and (25) of Section 4.3.2. First we compute 
wy = (...((w; — wh) cij — w3) e293 — +++ — wh 1) eGj—1yj mod my, (28) 
for j = 2,..., r, where wi = w mod mı; then we compute 
w= (...(wpmp_y + W1) Mp_g + +++ + w) Mm + Wh. (29) 


Here cij is a number such that cijm; = 1 (modulo mj); these numbers cij are 
not given, they must be determined from the e,’s. 

The calculation of (28) for all j involves (5) additions modulo mj, each 
of which takes O(e,) cycles, plus (5) multiplications by cj, modulo mj. The 
calculation of w by formula (29) involves r additions and r multiplications by mj; 
it is easy to multiply by mj, since this is just adding, shifting, and subtracting, 
so it is clear that the evaluation of Eq. (29) takes O(r?e,) cycles. We will soon 
see that each of the multiplications by c;;, modulo m,, requires only O(e; log er) 
cycles, and therefore it is possible to complete the entire job of conversion in 
O(r7e, log e,) cycles. 

These observations leave us with the following problem to solve: Given 
relatively prime positive integers e and f with e < f, and a nonnegative integer 
u < 2f, compute the value of (cu) mod (2f — 1), where c is the number such 
that (2° — 1)c = 1 (modulo 2f — 1); this entire computation must be done in 
O(f log f) cycles. The result of exercise 4.3.2—6 gives a formula for c that suggests 
a suitable procedure. First we find the least positive integer b such that 


be = 1 (modulo f). (30) 


Euclid’s algorithm will discover b in O((log f)?) cycles, since it requires O(log f) 
iterations when applied to e and f, and each iteration requires O((log f | cycles. 
Alternatively, we could be very sloppy here without violating the total time 
constraint, by simply trying b = 1, 2, etc., until (30) is satisfied; such a process 
would take O(f log f) cycles in all. Once b has been found, exercise 4.3.2-6 tells 


us that 
c= cfb] = ( b> 2) mod (2f — 1). (31) 


0<j<b 


A brute-force multiplication of (cu) mod (2f — 1) would not be good enough 
to solve the problem, since we do not know how to multiply general f-bit numbers 
in O(f log f) cycles. But the special form of c provides a clue: The binary 
representation of c is composed of bits in a regular pattern, and Eq. (31) shows 
that the number c[2b] can be obtained in a simple way from c[b]. This suggests 
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that we can rapidly multiply a number u by cfb] if we build c[b]u up in lgb steps 
in a suitably clever manner, such as the following: Suppose b is 
b = (b; . . . b2b1b0)2 


in binary notation; we can calculate four sequences ak, dk, Uk, Uk defined by 


ao = e, ak = 2ax—1 mod f; 
do = boe, dy = (dk—1 + bk ax) mod f; 
uo = u, Uk = (Up—1 + 2°"! up_1) mod (2f —1); (32) 
vo = bou, Uk = (Uk—1 + by 2™-tup) mod (2f — 1). 
It is easy to prove by induction on k that 
ap = (2%e) mod f; up = (c[2*]u) mod (2f — 1); Gi 


dp = ((bx .. . b1bo)2 €) mod f; Up = (c[(bp . . . b1bo)2]u) mod (2f — 1). 


Hence the desired result, (c[b]u) mod (2f — 1), is vs. The calculation of ax, dx, 
uk, and vk from ak—-1, dk-1, Uk—-1, Vk—ı takes O(log f) + O(log f) + O(f) + 
O(f) = O(f) cycles; consequently the entire calculation can be done in s O(f) = 
O(f log f) cycles as desired. 

The reader will find it instructive to study the ingenious method represented 
by (32) and (33) very carefully. Similar techniques are discussed in Section 4.6.3. 

Schönhage’s paper [Computing 1 (1966), 182-196] shows that these ideas 
can be extended to the multiplication of n-bit numbers using r ~ 2V2lgn moduli, 
obtaining a method analogous to Algorithm T. We shall not dwell on the details 
here, since Algorithm T is always superior; in fact, an even better method is 
next on our agenda. 


C. Discrete Fourier transforms. The critical problem in high-precision 
multiplication is the determination of “convolution products” such as 
UrVo + Ur—1U1 + +++ + UOUr, (34) 


and there is an intimate relation between convolutions and an important math- 
ematical concept called “Fourier transformation.” If w = exp(27i/K) is a Kth 
root of unity, the one-dimensional Fourier transform of the sequence of complex 


numbers (uo, u1,...,ug-—1) is defined to be the sequence (tio, fi1,...,%«K~-1), 
where 
ts = 5 we ur, O0<s<K. (35) 
O<t<K 
Letting (ĉo, 01,...,0«-1) be defined in the same way, as the Fourier transform 
of (vo, Vilje: ,UK-1); it is not difficult to see that (tiobo, 101, paa ,ûâKk-1ôx-—1) is 
the transform of (wọ, w1, ..., wWg—1), where 


Wr = UrVo + Ur—1V1 F: + Uor + UK -1Up41 Fo Ury- 


i+j=r (modulo K) 
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When K > 2n — 1 and uy, Unyi e UK—1 Un Unt1 

UK—1 = 0, the w’s are just what we need for multiplication, since the terms 
UK—1Ur41 +° + UpyiUK—1 vanish when 0 < r < 2n — 2. In other words, the 
transform of a convolution product is the ordinary product of the transforms. 
This idea is actually a special case of Toom’s use of polynomials (see (10)), with 
x replaced by roots of unity. 

If K is a power of 2, the discrete Fourier transform (35) can be obtained quite 
rapidly when the computations are arranged in a certain way, and so can the 
inverse transform (determining the w’s from the &’s). This property of Fourier 
transforms was exploited by V. Strassen in 1968, who discovered how to multiply 
large numbers faster than was possible under all previously known schemes. He 
and A. Schénhage later refined the method and published improved procedures 
in Computing 7 (1971), 281-292. Similar ideas, but with all-integer methods, 
had been worked out independently by J. M. Pollard [Math. Comp. 25 (1971), 
365-374]. In order to understand their approach to the problem, let us first take 
a look at the mechanism of fast Fourier transforms. 


Given a sequence of K = 2" complex numbers (uo,..., Ux —1), and given the 
complex number 


w = exp(27i/K), (37) 
the sequence (ti9,..., &«—1) defined in (35) can be calculated rapidly by carrying 
out the following scheme. (In these formulas the parameters sj and tj are either 
0 or 1, so that each “pass” represents 2” elementary computations.) 

Pass 0. Let Al (tkh-1,---,t0) = Ue, where t = (t,_1...t)o. 
Pass 1. Set All(s,_1,tp_2,...,to) 
A(O, th_o,--.,to) tw?” Se- AO (1, tp_o, ..., to). 
Pass 2. Set A?! (s,_1, 5,2, th-3,---,to) = 
AH (541, 0, th—3,---,to) tw? (se-2%%-1)2 Al (g,_ 1,1, ty_3,-.-, to). 


Pass k. Set Al*l(s,_1,...,51,80) = 
AVA 959: +, 81,0) + wy(s081---8k—1)2 Alk-lg, 4, ++; 81,1). 


It is fairly easy to prove by induction that we have 


AlN (sp4,...5 8e—j, te—j—1) ++ t0) = 5 wW?" 7 (8h—5---8e—1) 9 (tea te 9) 2 ay 
OSty—ayeste—j Sl (38) 
where t = (tk—1 . . . tıto)2, so that 


Alk (sp1, ..., 51, S0) = ts, where s = (8981... Sk—1)2- (39) 


(It is important to notice that the binary digits of s are reversed in the final 
result (39). Section 4.6.4 contains further discussion of transforms such as this.) 
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To get the inverse Fourier transform (ug,...,Wx—1) from the values of 
(tio,.--, x1), notice that the “double transform” is 
i, = 5 MURET = 5 ww u 
0<s<K O<s,t<K 
= s(t+r — 
= 5 uf 5 Ww ( ) = Kuc_r) mod K) (40) 
O<t<K O<s<K 


since the geometric series )°y-,- WI sums to zero unless j is a multiple of K. 
Therefore the inverse transform can be computed in the same way as the trans- 
form itself, except that the final results must be divided by K and shuffled 
slightly. 

Returning to the problem of integer multiplication, suppose we wish to 
compute the product of two n-bit integers u and v. As in Algorithm T we 
shall work with groups of bits; let 


In<2kl<4n, K=2%, L=2, (41) 


and write 
u = (UgK_1...U,Up)z, v = (Ve_1...ViVo)z, (42) 


regarding u and v as K-place numbers in radix L so that each digit U; or V; is 
an l-bit integer. Actually the leading digits U; and V; are zero for all j > K/2, 
because 2471] > n. We will select appropriate values for k and l later; at the 
moment our goal is to see what happens in general, so that we can choose k 
and / intelligently when all the facts are before us. 

The next step of the multiplication procedure is to compute the Fourier 


transforms (ûo,...,ûxg-—1ı) and (ĉo,..., x-1) of the sequences (uo,...,uK—1) 
and (v9,.--,UK—1), where we define 
w = Uat, w = Va 2PH. (43) 


This scaling is done for convenience so that each u; and v; is less than a 
ensuring that the absolute values |a@;| and || will be less than 1 for all s. 

An obvious problem arises here, since the complex number w can’t be 
represented exactly in binary notation. How are we going to compute a reliable 
Fourier transform? By a stroke of good luck, it turns out that everything will 
work properly if we do the calculations with only a modest amount of precision. 
For the moment let us bypass this question and assume that infinite-precision 
calculations are being performed; we shall analyze later how much accuracy is 
actually needed. 

Once the tu, and ô, have been found, we let Ù, = t,t, for 0 < s < K and 
determine the inverse Fourier transform (wo,...,wx«-1). As explained above, 


we now have 
= X = X 2k+21 
Wr = UiUj = U,V; /2 ; 
i+j=r itj=r 
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Table 1 
MULTIPLICATION VIA DISCRETE FOURIER TRANSFORMATION 
s ts 27 bs 24H, 21443, 2w, = Ws 
0 19 16 304 80 10 
1 2+4i+13w 5+9 +2w —26 + 64i+ 69w- 1250 0 69 
2 —2+13i —4+2i —18 — 56i 0 64 
3 2-4i-13@ 5-91-20 -26—-64i+125w-690 0 125 
4 =7 12 —84 288 36 
5 2+4i—13w 5+9i— 2w —26 + 64i— 69w -+1250 1000 0 
6 —2-13i —4— 2i —18 + 56i 512 0 
7 2—4i+130 5—9 +20 -26—64¢-125w+690 552 0 


so the integers W, = 2?*+?!w,, are the coefficients in the desired product 
uv = Wg- LE? +--+ Wi L + Wo. (44) 


Since 0 < W, < (r+ 1)L? < KL?, each W, has at most k + 21 bits, so it will 
not be difficult to compute the binary representation when the W’s are known 
unless k is large compared to l. 

For example, suppose we want to multiply u = 1234 times v = 2341 when 
the parameters are k = 3 and l = 4. The computation of (ûo,...,û7) from u 
proceeds as follows (see (12)): 


(r,s,t) = (0,0,0) (0,0,1) (0,1,0) (0,1,1) (1,0,0) (1,0,1) (1,1,0) (1,1,1) 
27AO (r, s, t)= 2 13 4 0 0 0 0 0 
27Al(r,s,t)= 2 13 4 0 2 13 4 0 
Q7APl(r,s,t)= 6 13 =2 13 2+4 13 2-4: 13 
27ABl(r,s,t)= 19 7 2+13i -2-13i a+8 a-B ā-ß ā+8 


Here a = 2 + 4i, 8 = 13w, and w = (1 + i)/vV2; this gives us the column headed 
27û, in Table 1. The column for 6, is obtained from v in the same way; then 
we multiply &, by 0; to get Ùs. Transforming again gives us ws and Ws, using 
relation (40). Once again we obtain the convolution products in (19), this time 
using complex numbers instead of sticking to an all-integer method. 

Let us try to estimate how much time this method takes on large numbers, 
if m-bit fixed point arithmetic is used in calculating the Fourier transforms. 
Exercise 10 shows that all of the quantities Al] during all the passes of the 
transform calculations will be less than 1 in magnitude because of the scaling 
(43), hence it suffices to deal with m-bit fractions (.a—1 . . .@—m )2 for the real and 
imaginary parts of all the intermediate quantities. Simplifications are possible 
because the inputs us and v; are real-valued; only K real values instead of 2K 
need to be carried in each step (see exercise 4.6.4-14). We will ignore such 
refinements in order to keep complications to a minimum. 

The first job is to compute w and its powers. For simplicity we shall make 
a table of the values w®, ..., w*—1. Let 


Ww, = exp(27i/2"), (45) 
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so that w1 = —1, wo = i, wz = (1+ 7)/V2, ..., we =w. If wp = 2, + iy, and 
r > 2, we have w,41 = 41 + ty-41 where 


1 + Tr Yr 
2 ş Yr+1 = 


Tr+1 Z (46) 


20y41 l 
[See S. R. Tate, IEEE Transactions SP-43 (1995), 1709-1711.] The calculation 
of w1, W2, ..., Wk takes negligible time compared with the other computations 
we need, so we can use any straightforward algorithm for square roots. Once the 
wr have been calculated we can compute all of the powers wf by noting that 

wae aa if j = (Jr-1---Jijo)e- (47) 
This method of calculation keeps errors from propagating, since each wf is a 
product of at most k of the w,’s. The total time to calculate all the wf is 
O(KM), where M is the time to do an m-bit complex multiplication, because 
only one multiplication is needed to obtain each wf from a previously computed 
value. The subsequent steps will require more than O(K M) cycles, so the powers 
of w have been computed at negligible cost. 

Each of the three Fourier transformations comprises k passes, each of which 
involves K operations of the form a + b+ w/c, so the total time to calculate the 
Fourier transforms is 

O(kKM) = O(Mnk/I). 


Finally, the work involved in computing the binary digits of u -v using (44) is 
O(K(k+1)) = O(n+nk/l). Summing over all operations, we find that the total 
time to multiply n-bit numbers u and v will be O(n) + O(Mnk/l). 

Now let’s see how large the intermediate precision m needs to be, so that 
we know how large M needs to be. For simplicity we shall be content with safe 
estimates of the accuracy, instead of finding the best possible bounds. It will be 
convenient to compute all the wf in such a way that our approximations (w/)’ will 
satisfy |(w7)’| < 1; this condition is easy to guarantee if we truncate towards zero 
instead of rounding, because x24, +y?,, = (1+ 22+ y?+2z,)/(2+2z,) in (46). 
The operations we need to perform with m-bit fixed point complex arithmetic 
are all obtained by replacing an exact computation of the form a + b + w/e by 
the approximate computation 


a’ < truncate(b! + (w!)'c’), (48) 


where b’, (wf), and c’ are previously computed approximations; all of these 
complex numbers and their approximations are bounded by 1 in absolute value. 
If |b’ — b| < by, (wY — w| < 5, and |e — c| < 5s, it is not difficult to see that 
we will have |a’ — a| < 6 + 61 + 62 + 63, where 


8 = [2-4 27 i] = 212m, (49) 


because we have |(W) — we] = |((w%)! — w4)e + wI(c — c)| < 62 + 63, and 
ô exceeds the maximum truncation error. The approximations (w/)’ are obtained 
by starting with approximations w’, to the numbers defined in (46), and we may 
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assume that (46) is performed with sufficient precision to make |w! — w,.| < ô. 
Then (47) implies that |(w)’ —w| < (2k —1)6 for all j, because the error is 
due to at most k approximations and k — 1 truncations. 

If we have errors of at most e€ before any pass of the fast Fourier transform, 
the operations of that pass therefore have the form (48) where 6, = 63 = € and 
dg = (2k — 1); the errors after the pass will then be at most 2e + 2kd. There is 
no error in Pass 0, so we find by induction on j that the maximum error after 
Pass j is bounded by (2/ — 1) - 2kd, and the computed values of &, will satisfy 
\(tts)’ — &s| < (2* —1)-2k6d. A similar formula will hold for (@;)’; and we will 
have 

|(@s)’ — Ws| < 2(2* — 1) -2k6 + ô < (4k2* — 2k)6. 


During the inverse transformation there is an additional accumulation of errors, 
but the division by K = 2" ameliorates most of this; by the same argument we 
find that the computed values wi. will satisfy 


< 2*(4k2"—2k)6 + (2*—1)2kd; |w! —w,| < 4k2"6. (50) 


|r)! -ûn 


We need enough precision to make 2?*+?!w/, round to the correct integer Wp, 


hence we need 


2k+214+2+lgk+k+1/2—m Li 
2 i / < T3 (51) 


that is, m > 3k + 2l + lg k + 7/2. This will hold if we simply require that 
k>7 and m > 4k+ 21. (52) 


Relations (41) and (52) can be used to determine parameters k, 1, m so that 
multiplication takes O(n) + O(Mnk/l) units of time, where M is the time to 
multiply m-bit fractions. 

If we are using MIX, for example, suppose we want to multiply binary num- 
bers having n = 2!3 = 8192 bits each. We can choose k = 11, 1 = 8, m = 60, 
so that the necessary m-bit operations are nothing more than double-precision 
arithmetic. The running time M needed to do fixed point m-bit complex multi- 
plication will therefore be comparatively small. With triple-precision operations 
we can go up for example to k = l = 15, n < 15-214, which takes us way beyond 
MIX’s memory capacity. On a larger machine we could multiply a pair of gigabit 
numbers if we took k = l = 27 and m = 144. 

Further study of the choice of k, l, and m leads in fact to a rather surprising 
conclusion: For all practical purposes we can assume that M is constant, and the 
Schénhage-Strassen multiplication technique will have a running time linearly 
proportional to n. The reason is that we can choose k = l and m = 6k; this 
choice of k is always less than lgn, so we will never need to use more than 
sextuple precision unless n is larger than the word size of our computer. (In 
particular, n would have to be larger than the capacity of an index register, so 
we probably couldn’t fit the numbers u and v in main memory.) 

The practical problem of fast multiplication is therefore solved, except for 
improvements in the constant factor. In fact, the all-integer convolution algo- 
rithm of exercise 4.6.4—59 is probably a better choice for practical high-precision 
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multiplication. Our interest in multiplying large numbers is partly theoretical, 
however, because it is interesting to explore the ultimate limits of computational 
complexity. So let’s forget practical considerations momentarily and suppose 
that n is extremely huge, perhaps much larger than the number of atoms in 
the universe. We can let m be approximately 6lg n, and use the same algorithm 
recursively to do the m-bit multiplications. The running time will satisfy T(n) = 
O(nT(logn)); hence 


T(n) < Cn(Clgn)(Clglgn)(Clglglgn)..., (53) 


where the product continues until reaching a factor with lg...lgn < 2. 

Schénhage and Strassen showed how to improve this theoretical upper bound 
to O(nlog nlog log n) in their paper, by using integer numbers w to carry out 
fast Fourier transforms on integers, modulo numbers of the form 2° + 1. This 
upper bound applies to Turing machines, namely to computers with bounded 
memory and a finite number of arbitrarily long tapes. 

If we allow ourselves a more powerful computer, with random access to any 
number of words of bounded size, Schénhage has pointed out that the upper 
bound drops to O(nlogn). For we can choose k = l and m = 6k, and we 
have time to build a complete multiplication table of all possible products xy 
for 0 < x,y < 21/121, (The number of such products is 2¥ or 2"+!, and we 
can compute each table entry by addition from one of its predecessors in O(k) 
steps, hence O(k2") = O(n) steps will suffice for the calculation.) In this case 
M is the time needed to do 12-place arithmetic in radix 2!/!?], and it follows 
that M = O(k) = O(log n) because 1-place multiplication can be done by table 
lookup. (The time to access a word of memory is assumed to be proportional to 
the number of bits in the address of that word.) 

Moreover, Schénhage discovered in 1979 that a pointer machine can carry 
out n-bit multiplication in O(n) steps; see exercise 12. Such devices (which are 
also called “storage modification machines” and “linking automata”) seem to 
provide the best models of computation when n — oo, as discussed at the end 
of Section 2.6. So we can conclude that multiplication in O(n) steps is possible 
for theoretical purposes as well as in practice. 

An unusual general-purpose computer called Little Fermat, with a spe- 
cial ability to multiply large integers rapidly, was designed in 1986 by D. V. 
Chudnovsky, G. V. Chudnovsky, M. M. Denneau, and S. G. Younis. Its hardware 
featured fast arithmetic modulo 2?56 + 1 on 257-bit words; a convolution of 256- 
word arrays could then be done using 256 single-word multiplications, together 
with three discrete transforms that required only addition, subtraction, and 
shifting. This made it possible to multiply two 10°-bit integers in less than 
0.1 second, based on a pipelined cycle time of approximately 60 nanoseconds 
(Proc. Third Int. Conf. on Supercomputing 2 (International Supercomputing 
Institute, 1988), 498-499; Contemporary Math. 143 (1993), 136]. 


D. Division. Now that we have efficient routines for multiplication, let’s 
consider the inverse problem. It turns out that division can be performed just 
as fast as multiplication, except for a constant factor. 
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To divide an n-bit number u by an n-bit number v, we can first find an 
n-bit approximation to 1/v, then multiply by u to get an approximation ĝ to 
u/v; finally, we can make the slight correction necessary to ĝ to ensure that 
0 < u— qv < v by using another multiplication. From this reasoning, we see 
that it suffices to have an efficient way to approximate the reciprocal of an n- 
bit number. The following algorithm does this, using “Newton’s method” as 
explained at the end of Section 4.3.1. 


Algorithm R (High-precision reciprocal). Let v have the binary representation 
v = (0.01203 ...)2, where vı = 1. This algorithm computes an approximation z 
to 1/v, such that 

|z = 1/v| < 27”. (54) 


R1. [Initial approximation.] Set z + $[32/(4v1 + 2v2 + v3)| and k + 0. 


R2. [Newtonian iteration.] (At this point we have a number z of the binary 
form (xa.r2...x)2 with 2¥ + 1 places after the radix point, and z < 2.) 
Calculate z2? = (rrx.rx...x%)2 exactly, using a high-speed multiplication 
routine. Then calculate Vp z? exactly, where Vp = (0.v1 v2... Uge+143)2- 
Then set z + 2z — Vp z? +r, where 0 < r < 2—2""!—1 is added if necessary 
to round z up so that it is a multiple of 272-1, Finally, set k + k +1. 


R3. [Test for end.] If 2* < n, go back to step R2; otherwise the algorithm 
terminates. J 


This algorithm is based on a suggestion by S. A. Cook. A similar technique 
has been used in computer hardware [see Anderson, Earle, Goldschmidt, and 
Powers, IBM J. Res. Dev. 11 (1967), 48-52]. Of course, it is necessary to check 
the accuracy of Algorithm R quite carefully, because it comes very close to being 
inaccurate. We will prove by induction that 


z<2 and |z—1/o)<2-” (55) 


at the beginning and end of step R2. 
For this purpose, let 6, = 1/v—zz, where zx is the value of z after k iterations 
of step R2. To start the induction on k, we have 


ôo = 1/v — 8/0" + (32/v! — |32/v'|)/4 = m + No, 


where v’ = (v1v2v3)2 and ņı = (v’ — 8v)/vv', so that we have —} < ņı < 0 and 
0 <2 < 4. Hence |ðo| < $. Now suppose that (55) has been verified for k; then 


Opti = 1/v — Zk41 = 1/0 — Zk — Ze (1 — 2eVe) — r 
= by — 2% (1 — av) — z22(v — Vk) -r 
= bp — (1/v — dy) vd, — 22(u — Ve) — r 
vô? — z2(u — Ve) — r. 


k+l 


Now 0 < vô? < 62 < (2-?")? =2-?"™" and 


0 < 2?(v — Vp) +r < 42720-3) 4 272 sae,” 
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so |Og41| < 2-2""" We must still verify the first inequality of (55); to show that 
Zk+1 < 2, there are three cases: 
a) V_ = $; then 241 = 2. 
b) Ve # 4 = Vp—1; then zk = 2, so 22, — 22V_ <2—2-7°7-1, 
c) Vk-1 Æ 33 then zp41 = 1/u — Op41 < 2 — ee 2, since k > 0. 
The running time of Algorithm R is bounded by 
2T(4n) + 2T(2n) + 2T(n) + 2T($n) + +--+ O(n) 


steps, where T(n) is an upper bound on the time needed to do a multiplication of 
n-bit numbers. If T(n) has the form n f(n) for some monotonically nondecreasing 
function f(n), we have 


T(4n) + T(2n) + T(n) +--+ < T(8n), (56) 


so division can be done with a speed comparable to that of multiplication except 
for a constant factor. 

R. P. Brent has shown that functions such as log x, exp x, and arctan z can 
be evaluated to n significant bits in O(M(n) log n) steps, if it takes M(n) units 
of time to multiply n-bit numbers [JACM 23 (1976), 242-251]. 


E. Multiplication in real time. It is natural to wonder if multiplication of 
n-bit numbers can be accomplished in just n steps. We have come from order 
n? down to order n, so perhaps we can squeeze the time down to the absolute 
minimum. In fact, it is actually possible to output the answer as fast as we input 
the digits, if we leave the domain of conventional computer programming and 
allow ourselves to build a computer that has an unlimited number of components 
all acting at once. 

A linear iterative array of automata is a set of devices Mı, Mo, Ms, ... 
that can each be in a finite set of “states” at each step of a computation. The 
machines Mə, M3, ... all have identical circuitry, and their state at time t + 1 
is a function of their own state at time t as well as the states of their left and 
right neighbors at time t. The first machine M; is slightly different: Its state at 
time t+ 1 is a function of its own state and that of Mə, at time t, and also of 
the input at time t. The output of a linear iterative array is a function defined 
on the states of Mı. 

Let u = (Un—1---U1Uo)2, V = (Un—1---¥1V0)2, and q = (Gn—1---4190)2 be 
binary numbers, and let uv + q = w = (Wan-1.-.W1Wo0)2. It is a remarkable 
fact that a linear iterative array can be constructed, independent of n, that will 
output wo, w1, w2, ... at times 1, 2, 3, ..., if it is given the inputs (wo, vo, go), 
(u1,U1,q1), (U2, V2,q2),-.-. at times 0,1, 2,.... 

We can state this phenomenon in the language of computer hardware by 
saying that it is possible to design a single integrated circuit module with the fol- 
lowing property: If we wire together sufficiently many of these chips in a straight 
line, with each module communicating only with its left and right neighbors, the 
resulting circuitry will produce the 2n-bit product of n-bit numbers in exactly 
2n clock pulses. 
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Table 2 
MULTIPLICATION IN A LINEAR ITERATIVE ARRAY 
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The basic idea can be understood as follows. At time 0, machine Mı senses 
(uo, vo, go) and it therefore is able to output (uovo + go) mod 2 at time 1. Then it 
sees (u1, v1,q1) and it can output (uovı + uivo + q1 + kı) mod 2, where kı is the 
“carry” left over from the previous step, at time 2. Next it sees (u2, v2, q2) and 
outputs (ugve + uv + u2v0 + q2 + k2) mod 2; furthermore, its state holds the 
values of uz and v2 so that machine Mə will be able to sense these values at time 3, 
and Mə will be able to compute usv for the benefit of Mı at time 4. Machine 
M; essentially arranges to start Mə multiplying the sequence (uz, v2), (us, v3), 
..., and Mg will ultimately give M3 the job of multiplying (u4, v4), (us, us), etc. 
Fortunately, things just work out so that no time is lost. The reader will find it 
interesting to deduce further details from the formal description that follows. 

Each automaton has 21! states (c, £o, Yo, £1, Y1, Z, Y, Z2, 21, Zo), where 
0 < c < 4 and each of the a’s, y’s, and z’s is either 0 or 1. Initially, all the 
devices are in state (0,0,0,0,0,0,0,0,0,0). Suppose that a machine M}, for 
j > 1, is in state (c, £o, Yo, £1, Y1, Z, Y, 22, 21, 20) at time t, and its left neighbor 
M;-1 is in state (c, xh, yb, £1, y1, xt, yl, 25, 21, 24) while its right neighbor Mj+1 
is in state (c", 29, 96,77, Y1, T”, Y”, 23, 2], 2) at that time. Then machine M; will 
go into state (c', £0, Y0, T1, Y1, L, Y’, 24, 21, 2) at time t+ 1, where 


d = min(e +1, 3) ifd =3, 0 otherwise; 
(xh, yh) = (ea if c=0, (x9, Yo) otherwise; es 
(£1, y1) = (x,y!) if c=1, (x1,y1) otherwise; af 
(evi a) if 62 2, (x,y) otherwise; 
and (z5z120)2 is the binary notation for 
gly! if c= 0; 
r l zoy! + x'yo if c= 1; 
Zo + 21 + 22 + (58) 


toy! + £1Y1 + 2! yo if c = 2; 
Toy! + 1y + £y + x'yo ifc=3. 


The leftmost machine Mı behaves in almost the same way as the others; it acts 
exactly as if there were a machine to its left in state (3, 0,0,0,0, u, v, q, 0,0) when 
it is receiving the inputs (u,v, q). The output of the array is the zọ component 
of M, j 

Table 2 shows an example of this array acting on the inputs 


u =v = ( ...00010111)2, q = ( . . .00001011)2. 
The output sequence appears in the lower right portion of the states of Mı: 
0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, ..., 


representing the number (. ..01000011100)2 from right to left. 

This construction is based on a similar one first published by A. J. Atrubin, 
IEEE Trans. EC-14 (1965), 394-399. 

Fast as it is, the iterative array is optimum only when the input bits arrive 
one at a time. If the input bits are all present simultaneously, we prefer parallel 
circuitry that will obtain the product of two n-bit numbers after O(log n) levels 
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of delay. Efficient circuits of that kind have been described, for example, by 
C. S. Wallace, IEEE Trans. EC-13 (1964), 14-17; D. E. Knuth, The Stanford 
GraphBase (New York: ACM Press, 1994), 270-279. 

S. Winograd [JACM 14 (1967), 793-802] has investigated the minimum 
multiplication time achievable in a logical circuit when n is given and when the 
inputs are available all at once in arbitrarily coded form. For similar questions 
when multiplication and addition must both be supported simultaneously, see 
A. C. Yao, STOC 13 (1981), 308-311; Mansour, Nisan, and Tiwari, STOC 22 
(1990), 235-243. 


Multiplication is mie vexation, 
And Division is quite as bad: 
The Golden Rule is mie stumbling stule, 
And Practice drives me mad. 


— Manuscript collected by J. O. HALLIWELL (c. 1570) 


EXERCISES 


1. [22] The idea expressed in (2) can be generalized to the decimal system, if the 
radix 2 is replaced by 10. Using this generalization, calculate 1234 times 2341 (reducing 
this product of four-digit numbers to three products of two-digit numbers, and reducing 
each of the latter to products of one-digit numbers). 


2. [M22] Prove that, in step T1 of Algorithm T, the value of R either stays the same 
or increases by one when we set R + |VQ|. (Therefore, as observed in that step, we 
need not calculate a square root.) 


3. [M22] Prove that the sequences qk and rz defined in Algorithm T satisfy the 
inequality 29+! (2r,)"* < 2%-14% , when k > 0. 


4. [28] (K. Baker.) Show that it is advantageous to evaluate the polynomial W (x) 
at the points x = —r, ..., 0, ..., r instead of at the nonnegative points x = 0, 1, 
..., 2r as in Algorithm T. The polynomial U(x) can be written 


U(x) = Ue(a*) + eUo(x”), 
and similarly V(x) and W(x) can be expanded in this way; show how to exploit this 
idea, obtaining faster calculations in steps T7 and T8. 


5. [35] Show that if in step T1 of Algorithm T we set R + [V2Q] + 1 instead of 


setting R + |VQ], with suitable initial values of go, q1, ro, and rı, then (20) can be 
Bee (lg gk+1). 


6. [M23] Prove that the six numbers in (24) are relatively prime in pairs. 


7. [M23] Prove (25). 


8. [M20] True or false: We can ignore the bit reversal ($,-1,..., $0) > (80,---,$8%—-1) 
in (39), because the inverse Fourier transform will reverse the bits again anyway. 


improved to tk < qk+12 


9. [M15] Suppose the Fourier transformation method of the text is applied with all 
occurrences of w replaced by w1, where q is some fixed integer. Find a simple relation 
between the numbers (ŭo, ŭ1,...,ŭgx—1ı) obtained by this general procedure and the 
numbers (do, &1,...,&«-1) obtained when q = 1. 


4.3.3 HOW FAST CAN WE MULTIPLY? 317 


10. [M26] The scaling in (43) makes it clear that all the complex numbers Al! 
computed by pass j of the transformation subroutine will be less than 2’—* in absolute 
value, during the calculations of ûs and ôs in the Sch6nhage-Strassen multiplication 
algorithm. Show that all of the Al] will be less than 1 in absolute value during the 
third Fourier transformation (the calculation of Ô»). 


11. [M26] If n is fixed, how many of the automata in the linear iterative array defined 
by (57) and (58) are needed to compute the product of n-bit numbers? (Notice that the 
automaton M; is influenced only by the component zg of the machine on its right, so 
we may remove all automata whose z component is always zero whenever the inputs 
are n-bit numbers.) 


12. [M41] (A. Schénhage.) The purpose of this exercise is to prove that a simple 
form of pointer machine can multiply n-bit numbers in O(n) steps. The machine has 
no built-in facilities for arithmetic; all it does is work with nodes and pointers. Each 
node has the same finite number of link fields, and there are finitely many link registers. 
The only operations this machine can do are: 


i) read one bit of input and jump if that bit is 0; 

ii) output 0 or 1; 

iii) load a register with the contents of another register or with the contents of a 
link field in a node pointed to by a register; 

iv) store the contents of a register into a link field in a node pointed to by a register; 
v) jump if two registers are equal; 

vi) create a new node and make a register point to it; 

vii) halt. 


Implement the Fourier-transform multiplication method efficiently on such a machine. 
[Hints: First show that if N is any positive integer, it is possible to create N nodes 
representing the integers {0,1,..., N — 1}, where the node representing p has pointers 
to the nodes representing p+ 1, |p/2], and 2p. These nodes can be created in O(N) 
steps. Show that arithmetic with radix N can now be simulated without difficulty: For 
example, it takes O(log N) steps to find the node for (p+ q) mod N and to determine if 
p+q 2 N, given pointers to p and q; and multiplication can be simulated in O(log N )? 
steps. Now consider the algorithm in the text, with k = l and m = 6k and N = gim/13] A 
so that all quantities in the fixed point arithmetic calculations are 13-place integers with 
radix N. Finally, show that each pass of the fast Fourier transformations can be done 
in O(K + (N log N)”) = O(K) steps, using the following idea: Each of the K necessary 
assignments can be “compiled” into a bounded list of instructions for a simulated MIX- 
like computer whose word size is N, and instructions for K such machines acting in 
parallel can be simulated in O(K + (N log N)”) steps if they are first sorted so that 
all identical instructions are performed together. (Two instructions are identical if 
they have the same operation code, the same register contents, and the same memory 
operand contents.) Note that N? = O(n'?/"%), so (N log N)? = O(K).] 


13. [M25] (A. Schénhage.) What is a good upper bound on the time needed to 
multiply an m-bit number by an n-bit number, when both m and n are very large but 
n is much larger than m, based on the results discussed in this section for the case 
m=n? 

14. [M42] Write a program for Algorithm T, incorporating the improvements of 
exercise 4. Compare it with a program for Algorithm 4.3.1M and with a program 
based on (2), to see how large n must be before Algorithm T is an improvement. 


318 ARITHMETIC 4.3.3 


15. [M49] (S. A. Cook.) A multiplication algorithm is said to be online if the (k+1)st 
input bits of the operands, from right to left, are not read until the kth output bit 
has been produced. What are the fastest possible online multiplication algorithms 
achievable on various species of automata? 


16. [25] Prove that it takes only O(K log K) arithmetic operations to evaluate the 
discrete Fourier transform (35), even when K is not a power of 2. [Hint: Rewrite (35) 


in the form : 5 p 
ûs = wS /2 X i w(t’) [27t lu, 
O<t<K 
and express this sum as a convolution product.] 


17. [M26] Karatsuba’s multiplication scheme (2) does Kn» 1-place multiplications 


when it forms the product of n-place numbers, where Kı = 1, Kan = 3Kn, and 
Kony = 2Kn414+ Kn for n > 1. “Solve” this recurrence by finding an explicit formula 
for Kn when n = 2¢1 + 262 4+.---+4+2¢,e, > e2 >- > er > 0. 


18. [M30] Devise a scheme to allocate memory for the intermediate results when 
multiplication is performed by a recursive algorithm based on (2): Given two N-place 
integers u and v, each in N consecutive places of memory, show how to arrange the 
computation so that the product uv appears in the least significant 2N places of a 
(3N + O(log N))-place area of working storage. 


19. [M23] Show how to compute uv mod m with a bounded number of operations that 
meet the ground rules of exercise 3.2.1.1—-11, if you are also allowed to test whether 
one operand is less than another. Both u and v are variable, but m is constant. Hint: 
Consider the decomposition in (2). 
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4.4. RADIX CONVERSION 


IF OUR ANCESTORS had invented arithmetic by counting with their two fists or 
their eight fingers, instead of their ten “digits,” we would never have to worry 
about writing binary-decimal conversion routines. (And we would perhaps never 
have learned as much about number systems.) In this section, we shall discuss 
the conversion of numbers from positional notation with one radix into positional 
notation with another radix; this process is, of course, most important on binary 
computers when converting decimal input data into binary form, and converting 
binary answers into decimal form. 


A. The four basic methods. Binary-decimal conversion is one of the most 
machine-dependent operations of all, since computer designers keep inventing 
different ways to provide for it in the hardware. Therefore we shall discuss 
only the general principles involved, from which programmers can select the 
procedures that are best suited to their machines. 

We shall assume that only nonnegative numbers enter into the conversion, 
since the manipulation of signs is easily accounted for. 

Let us assume that we are converting from radix b to radix B. (Mixed- 
radix generalizations are considered in exercises 1 and 2.) Most radix-conversion 
routines are based on multiplication and division, using one of the four methods 
below. The first two methods apply to integers (radix point at the right), and the 
others to fractions (radix point at the left). It is often impossible to express a ter- 
minating radix-b fraction (0.u_ju_2...U-m)» exactly as a terminating radix-B 
fraction (0.U_jU_2...U_y)g. For example, the fraction b has the infinite 
binary representation (0.0001100110011 ...)2. Therefore methods of rounding 
the result to M places are sometimes necessary. 


Method 1a (Division by B using radix-b arithmetic). Given an integer u, we 
can obtain its radix-B representation (...U2U,Upo) x as follows: 


Up =umod B, U= |u/B| mod B, U,=||u/B|/B|modB, ..., 
stopping when |...||u/B|/B].../B| =0. 


Method 1b (Multiplication by b using radix-B arithmetic). If u has the radix-b 
representation (Um ...u1uo)b, we can use radix-B arithmetic to evaluate the 
polynomial u,,b™ + --- + u1b + uo = u in the form 


((... (Umb + Um—1) b+ +++) b+ u1) b + uo. 


Method 2a (Multiplication by B using radix-b arithmetic). Given a fractional 
number u, we can obtain the digits of its radix-B representation (.U_1U_2...)B 
as follows: 


Us=|uB], Us=|{uB}B], Us=[{{uB}B}B], ..., 


where {x} denotes z mod1 = x — |x|. If it is desired to round the result 
to M places, the computation can be stopped after U_j,y has been calculated, 
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and U_m should be increased by unity if {... {{uB}B}...B} is greater than 5. 
(Note, however, that this may cause carries to propagate, and these carries must 
be incorporated into the answer using radix-B arithmetic. It would be simpler to 
add the constant BM to the original number u before the calculation begins, 
but this may lead to an incorrect answer when BM cannot be represented 
exactly as a radix-b number inside the computer. Note further that it is possible 
for the answer to round up to (1.00...0)g, if b” > 2B™.) 

Exercise 3 shows how to extend this method so that M is variable, just large 
enough to represent the original number to a specified accuracy. In this case the 
problem of carries does not occur. 


Method 2b (Division by b using radix-B arithmetic). If u has the radix-b 
representation (0.uw_1u_2...U—m)», we can use radix-B arithmetic to evaluate 
ub! + u_ob-?7 +--+ u_mb7~™ in the form 


((..- (Wim /b+ Ui-m)/b + +++ + u_2)/b+ u_s) /b. 


Care should be taken to control errors that might occur due to truncation or 
rounding in the division by b; these are often negligible, but not always. 


To summarize, Methods la, 1b, 2a, and 2b give us two ways to convert 
integers and two ways to convert fractions; and it is certainly possible to convert 
between integers and fractions by multiplying or dividing by an appropriate 
power of b or B. Therefore there are at least four methods to choose from when 
trying to do radix conversion. 


B. Single-precision conversion. To illustrate these four methods, let us 
assume that MIX is a binary computer, and suppose that we want to convert a 
nonnegative binary integer u to a decimal integer. Thus b = 2 and B = 10. 
Method la could be programmed as follows: 


ENT1 O Set j + 0. 
LDX U 
ENTA O Set rAX & u. 
1H DIV =10= (rA, rX) + ([rAX/10],rAX mod 10). (1) 
STX ANSWER,1 Uj <rX 
INC1 1 jeg. 
SRAX 5 rAX rA. 
JXP 1B Repeat until result is zero. | 


This requires 18M + 4 cycles to obtain M digits. 

Method 1a uses division by 10; Method 2a uses multiplication by 10, so it 
might be a little faster. But in order to use Method 2a, we must deal with 
fractions, and this leads to an interesting situation. Let w be the word size of 
the computer, and assume that u < 10” < w. With a single division we can find 
q and r, where 


wu = 10”q +r, 0<r< 10”. (2) 
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Now if we apply Method 2a to the fraction (q + 1)/w, we will obtain the digits 
of u from left to right, in n steps, since 


|10” m = [u+ o -| =u. (3) 


(This idea is due to P. A. Samet, Software Practice & Experience 1 (1971), 
93-96.) 
Here is the corresponding MIX program: 


JOV OFLO Ensure that overflow is off. 

LDA U 

LDX =10"= rAX + wu + 10”. 

DIV =10"= rA q+1,rX <r. 

JOV ERROR Jump if u > 10”. 

ENT1 n-1 Set j n—1. (4) 
2H MUL =10= Now imagine the radix point at the left, rA = z. 

STA ANSWER,1 Set U; + |10z]. 

SLAX 5 x + {10r}. 

DEC1 1 j}j-l. 

J1NN 2B Repeat forn>j>0. I 


This slightly longer routine requires 16n + 19 cycles, so it is a little faster than 
program (1) if n = M > 8; when leading zeros are present, (1) will be faster. 

Program (4) as it stands cannot be used to convert integers u > 10 when 
10” < w < 10™*!, since we would need to take n = m+ 1. In this case we 
can obtain the leading digit of u by computing |u/10™ |; then u mod 10” can be 
converted as above with n = m. 

The fact that the answer digits are obtained from left to right may be an 
advantage in some applications (for example, when typing out an answer one 
digit at a time). Thus we see that a fractional method can be used for conversion 
of integers, although the use of inexact division makes a little bit of numerical 
analysis necessary. 

We can avoid the division by 10 in Method 1a if we do two multiplications 
instead. This alternative can be important, because radix conversion is often 
done by “satellite” computers that have no built-in division capability. If we let 


x be an approximation to b so that 


it is easy to prove (see exercise 7) that |ux| = |u/10| or [u/10]| +1, so long as 
0<u<w. Therefore, if we compute u — 10|ux|, we will be able to determine 
the value of |u/10]: 


|u/10| = [uz] — [u < 10ļux]]. (5) 


At the same time we will have determined u mod 10. A MIX program for conver- 
sion using (5) appears in exercise 8; it requires about 33 cycles per digit. 
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If the computer has neither division nor multiplication in its repertoire of 
built-in instructions, we can still use Method la for conversion by judiciously 
shifting and adding, as explained in exercise 9. 

Another way to convert from binary to decimal is to use Method 1b, but to 
do this we need to simulate doubling in a decimal number system. This approach 
is generally most suitable for incorporation into computer hardware; however, it 
is possible to program the doubling process for decimal numbers, using binary 
addition, binary shifting, and binary extraction or masking (bitwise AND) as 
shown in Table 1, which was suggested by Peter L. Montgomery. 


Table 1 
DOUBLING A BINARY-CODED DECIMAL NUMBER 
Operation General form Example 
le Given u11 U10 Uo Us U7 Us Us Us UZ UQU1Uo 001101101001 = 369 
number 
2. Add 3 to U11 V10 Vg Vg V7 Ve V5 V4 V3 V2 V1 VO 0110 1001 1100 
each digit 
3. Extract each 4 0 0 Ow OO 0% 0 00 000010001000 
high bit 


4. Shift right 2 
and subtract 
5. Add original 


0 VII v110 0 U7 Uy 0 0 U3 U3 0 0000 01100110 


W11W190W9Ws W7W6W5 W4 W3W2W1 WO 0011 1100 1111 
number 
6. Add original ac m1 £10 £o T8 teat ao eee 0011100111000 = 738 
number 


This method changes each individual digit d into 2d when 0 < d < 4, and 
into 6 + 2d = (2d — 10) + 24 when 5 < d < 9; and that is just what is needed to 
double decimal numbers encoded with 4 bits per digit. 

Another related idea is to keep a table of the powers of two in decimal form, 
and to add the appropriate powers together by simulating decimal addition. A 
survey of bit-manipulation techniques appears in Section 7.1.3. 

Finally, even Method 2b can be used for the conversion of binary integers to 
decimal integers. We can find q as in (2), and then we can simulate the decimal 
division of q + 1 by w, using a “halving” process (exercise 10) that is similar 
to the doubling process just described, retaining only the first n digits to the 
right of the radix point in the answer. In this situation, Method 2b does not 
seem to offer advantages over the other three methods already discussed, but we 
have confirmed the remark made earlier that at least four distinct methods are 
available for converting integers from one radix to another. 


Now let us consider decimal-to-binary conversion (so that b = 10, B = 2). 
Method la simulates a decimal division by 2; this is feasible (see exercise 10), 
but it is primarily suitable for incorporation in hardware instead of programs. 

Method 1b is the most practical method for decimal-to-binary conversion 
in the great majority of cases. The following MIX code assumes that there 
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are at least two digits in the number (uw... U1Ug)19 being converted, and that 
10+! < w so that overflow is not an issue: 


ENT1 M-1 Set j m-—1. 
LDA INPUT+M Set U< um. 
1H MUL =10= 
SLAX 5 (6) 
ADD INPUT,1 U+10U+4+ uj. 
DEC1 1 
JiNN 1B Repeat form >j>0. J 


The multiplication by 10 could be replaced by shifting and adding. 

A trickier but perhaps faster method, which uses about lg m multiplications, 
extractions, and additions instead of m — 1 multiplications and additions, is 
described in exercise 19. 

For the conversion of decimal fractions (0.u_1u_2...U—m)1o to binary form, 
we can use Method 2b; or, more commonly, we can first convert the integer 
(u_1U—2...U—m)1o by Method 1b and then divide by 10”. 


C. Hand calculation. It is occasionally necessary for computer programmers to 
convert numbers by hand, and since this is a subject not yet taught in elementary 
schools, it may be worthwhile to examine it briefly here. There are simple pencil- 
and-paper methods for converting between decimal and octal notations, and 
these methods are easily learned, so they should be more widely known. 


Converting octal integers to decimal. The simplest conversion is from octal 
to decimal; this technique was apparently first published by Walter Soden, Math. 
Comp. 7 (1953), 273-274. To do the conversion, write down the given octal num- 
ber; then at the kth step, double the & leading digits using decimal arithmetic, 
and subtract this from the k + 1 leading digits using decimal arithmetic. The 
process terminates in m steps if the given number has m + 1 digits. It is a good 
idea to insert a radix point to show which digits are being doubled, as shown in 
the following example, in order to prevent embarrassing mistakes. 


Example 1. Convert (5325121 )g to decimal. 
5.325121 


| 
| = 
oO 


21 


aw 
CO WwW 
D w 
ao 
Be 


w 


N 
Blo nle wlonloaw 
Elon aR Elo nloo 
ole eloo wl a 
œ| o |x anlo n 


Answer: (1419857)10. 
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A reasonably good check on the computations may be had by “casting out 
nines”: The sum of the digits of the decimal number must be congruent modulo 9 
to the alternating sum and difference of the digits of the octal number, with the 
rightmost digit of the latter given a plus sign. In the example above, we have 
14+44+14+9+8+4+547 = 35, and 1—2+1-54+2-—3+5 = -1,; the 
difference is 36 (a multiple of 9). If this test fails, it can be applied to the k +1 
leading digits after the kth step, and the error can be located using a “binary 
search” procedure; in other words, we can locate the error by first checking the 
middle result, then using the same procedure on the first or second half of the 
calculation, depending on whether the middle result is incorrect or correct. 

The “casting-out-nines” process is only about 89 percent reliable, because 
there is one chance in nine that two random integers will differ by a multiple of 
nine. An even better check is to convert the answer back to octal by using an 
inverse method, which we shall now consider. 


Converting decimal integers to octal. A similar procedure can be used for 
the opposite conversion: Write down the given decimal number; then at the kth 
step, double the & leading digits using octal arithmetic, and add these to the 
k +1 leading digits using octal arithmetic. The process terminates in m steps if 
the given number has m + 1 digits. 


Example 2. Convert (1419857)10 to octal. 


7 
+1052502 


5825121 Answer: (5825121 )s. 


(Notice that the nonoctal digits 8 and 9 enter into this octal computation.) 
The answer can be checked as discussed above. This method was published by 
Charles P. Rozier, IEEE Trans. EC-11 (1962), 708-709. 


The two procedures just given are essentially Method 1b of the general 
radix-conversion procedures. Doubling and subtracting in decimal notation is 
like multiplying by 10 — 2 = 8; doubling and adding in octal notation is like 
multiplying by 8 +2 = 10. There is a similar method for hexadecimal/decimal 
conversions, but it is a little more difficult since it involves multiplication by 6 
instead of by 2. 
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To keep these two methods straight in our minds, it is not hard to remember 
that we must subtract to go from octal to decimal, since the decimal representa- 
tion of a number is smaller; similarly we must add to go from decimal to octal. 
The computations are performed using the radix of the answer, not the radix of 
the given number, otherwise we couldn’t get the desired answer. 


Converting fractions. No equally fast method of converting fractions manually 
is known. The best way seems to be Method 2a, with doubling and adding 
or subtracting to simplify the multiplications by 10 or by 8. In this case, we 
reverse the addition-subtraction criterion, adding when we convert to decimal 
and subtracting when we convert to octal; we also use the radix of the given 
input number, not the radix of the answer, in this computation (see Examples 
3 and 4). The process is about twice as hard as the method that we used for 
integers. 


96896 Answer: (.110874...)s. 
10374) to decimal. 


Example 3. Convert (.14159)19 to octal. 
14159 
2831 8- 
1.13272 
2654 4- 
1.06176 
1235 2- 
0.49408 
9881 6- 
8.95264 
19052 8- 
7.62112 
12422 4- 
4 
1 


Example 4. Convert (. 


38400 Answer: (.141586...)10. 
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D. Floating point conversion. When floating point values are to be con- 
verted, it is necessary to deal with both the exponent and the fraction parts 
simultaneously, since conversion of the exponent will affect the fraction part. 
Given the number f - 2° to be converted to decimal, we may express 2° in the 
form F : 10” (usually by means of auxiliary tables), and then convert Ff to 
decimal. Alternatively, we can multiply e by log;, 2 and round this to the nearest 
integer F; then divide f -2° by 10” and convert the result. Conversely, given the 
number F'- 10” to be converted to binary, we may convert F and then multiply 
it by the floating point number 10” (again by using auxiliary tables). Obvious 
techniques can be used to reduce the maximum size of the auxiliary tables by 
using several multiplications and/or divisions, although this can cause rounding 
errors to propagate. Exercise 17 considers the minimization of error. 


E. Multiple-precision conversion. When converting extremely long numbers, 
it is most convenient to start by converting blocks of digits, which can be handled 
by single-precision techniques, and then to combine these blocks by using simple 
multiple-precision techniques. For example, suppose that 10” is the highest 
power of 10 less than the computer word size. Then: 


a) To convert a multiple-precision integer from binary to decimal, divide it 
repeatedly by 10” (thus converting from binary to radix 10” by Method 1a). 
Single-precision operations will give the n decimal digits for each place of the 
radix-10” representation. 

b) To convert a multiple-precision fraction from binary to decimal, proceed 
similarly, multiplying by 10” (that is, using Method 2a with B = 10”). 

c) To convert a multiple-precision integer from decimal to binary, convert 
blocks of n digits first; then use Method 1b to convert from radix 10” to binary. 

d) To convert a multiple-precision fraction from decimal to binary, convert first 
to radix 10” as in (c), then use Method 2b. 


F. History and Bibliography. Radix-conversion techniques implicitly origi- 
nated in ancient problems dealing with weights, measures, and currencies, where 
mixed-radix systems were generally involved. Auxiliary tables were usually 
prepared to help people make the conversions. During the seventeenth century, 
when sexagesimal fractions were being supplanted by decimal fractions, it was 
necessary to convert between the two systems in order to use existing books of 
astronomical tables; a systematic method to transform fractions from radix 60 
to radix 10 and vice versa was given in the 1667 edition of William Oughtred’s 
Clavis Mathematice, Chapter 6, Section 18. (This material was not present 
in the original 1631 edition of Oughtred’s book.) Conversion rules had already 
been given by al-Kashi of Samarkand in his Key to Arithmetic (1427), where 
Methods la, 1b, and 2a are clearly explained [Istoriko-Mat. Issled. 7 (1954), 
126-135], but his work was unknown in Europe. The 18th century American 
mathematician Hugh Jones used the words “octavation” and “decimation” to 
describe octal/decimal conversions, but his methods were not as clever as his 
terminology. A. M. Legendre [Théorie des Nombres (Paris: 1798), 229] noted 
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that positive integers may be conveniently converted to binary form if they are 
repeatedly divided by 64. 

In 1946, H. H. Goldstine and J. von Neumann gave prominent consideration 
to radix conversion in their classic memoir, Planning and Coding Problems for 
an Electronic Computing Instrument, because it was necessary to justify the use 
of binary arithmetic; see John von Neumann, Collected Works 5 (New York: 
Macmillan, 1963), 127-142. Another early discussion of radix conversion on 
binary computers was published by F. Koons and S. Lubkin, Math. Comp. 3 
(1949), 427-431, who suggested a rather unusual method. The first discussion 
of floating point conversion was given somewhat later by F. L. Bauer and K. 
Samelson |Zeit. fiir angewandte Math. und Physik 4 (1953), 312-316]. 

The following articles are, similarly, of historic interest: A note by G. T. 
Lake [CACM 5 (1962), 468-469] mentioned some hardware techniques for con- 
version and gave clear examples. A. H. Stroud and D. Secrest [Comp. J. 6 
(1963), 62-66] discussed conversion of multiple-precision floating point numbers. 
The conversion of unnormalized floating point numbers, preserving the amount of 
“significance” implied by the representation, was discussed by H. Kanner [JACM 
12 (1965), 242-246] and by N. Metropolis and R. L. Ashenhurst [Math. Comp. 
19 (1965), 435-441]. See also K. Sikdar, Sankhya B30 (1968), 315-334, and the 
references cited in his paper. 

Detailed subroutines for formatted input and output of integers and floating 
point numbers in the C programming language have been given by P. J. Plauger 
in The Standard C Library (Prentice-Hall, 1992), 301-331. 


EXERCISES 


1. [25] Generalize Method 1b so that it works with arbitrary mixed-radix notations, 
converting 


Ambm—1.--b1bo +--+ aibo +a into AmBy-i...Bi Bo +-:-+A1Bo+ Ao, 


where 0 <a; <b; and 0 < A; < By forO<j<mand0<J<M. 

Give an example of your generalization by manually converting “3 days, 9 hours, 
12 minutes, and 37 seconds” into long tons, hundredweights, stones, pounds, and 
ounces. (Let one second equal one ounce. The British system of weights has 1 stone = 
14 pounds, 1 hundredweight = 8 stone, 1 long ton = 20 hundredweight.) In other 
words, let bo = 60, bı = 60, b2 = 24, m = 3, Bo = 16, Bi = 14, Bo = 8, 
B3 = 20, M = 4; the problem is to find A4, ..., Ao in the proper ranges such that 
3b2b1b9 + 9b1b9 + 1269 + 37 = A4 B3 B2 Bı Bo + A3 B2 Bı Bo + A2 Bı Bo + Aı Bo + Ao, using 
a systematic method that generalizes Method 1b. (All arithmetic is to be done in a 
mixed-radix system.) 


2. [25] Generalize Method la so that it works with mixed-radix notations, as in 
exercise 1, and give an example of your generalization by manually solving the same 
conversion problem stated in exercise 1. 


3. [25] (D. Taranto.) When fractions are being converted, there is no obvious way to 
decide how many digits to give in the answer. Design a simple generalization of Method 
2a that, given two positive radix-b fractions u and € between 0 and 1, converts u to a 
rounded radix-B equivalent U that has just enough places M to the right of the radix 
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point to ensure that |U — u| < e. (In particular if u is a multiple of b~™ and e€ = b-™/2, 
the value of U will have just enough digits so that u can be recomputed exactly, given 
U and m. Note that M might be zero; for example, if € < 4 and u > 1 — e, the proper 
answer is U = 1.) 


4. [M21] (a) Prove that every real number with a terminating binary representation 
also has a terminating decimal representation. (b) Find a simple condition on the 
positive integers b and B that is satisfied if and only if every real number that has a 
terminating radix-b representation also has a terminating radix-B representation. 


5. [M20] Show that program (4) would still work if the instruction ‘LDX =10”"=’ were 
replaced by ‘LDX =c=’ for certain other constants c. 


6. [30] Discuss using Methods la, 1b, 2a, and 2b when b or B is —2. 


7. [M18] Given that 0 < a < x < a + 1/w and 0 < u < w, where u is an integer, 
prove that |ux] is equal to either [au] or |au| +1. Furthermore |ux| = |au] exactly, 
if u < aw and a’ is an integer. 


8. [24] Write a MIX program analogous to (1) that uses (5) and includes no division 
instructions. 


9. [M29] The purpose of this exercise is to compute |u/10| with binary shifting and 
addition operations only, when u is a nonnegative integer. Let vo(u) = 3|u/2| +3 and 


Ve+1(U) = UR(u) + | ve(u) /22 | for k > 0. 


Given k, what is the smallest nonnegative integer u such that |vz[u]/16| # |u/10|? 


10. [22] Table 1 shows how a binary-coded decimal number can be doubled by using 
various shifting, extracting, and addition operations on a binary computer. Give an 
analogous method that computes half of a binary-coded decimal number (throwing 
away the remainder if the number is odd). 


11. [16] Convert (57721)s to decimal. 


12. [22] Invent a rapid pencil-and-paper method for converting integers from ternary 
notation to decimal, and illustrate your method by converting (1212011210210)3 into 
decimal. How would you go from decimal to ternary? 


13. [25] Assume that locations U + 1, U + 2, ..., U+ m contain a multiple-precision 
fraction (.w-1u—2...U—m)», where b is the word size of MIX. Write a MIX routine that 
converts this fraction to decimal notation, truncating it to 180 decimal digits. The 
answer should be printed on two lines, with the digits grouped into 20 blocks of nine 
each separated by blanks. (Use the CHAR instruction.) 


14. [M27] (A. Schönhage.) The text’s method of converting multiple-precision in- 
tegers requires an execution time of order n? to convert an n-place integer, when 
n is large. Show that it is possible to convert n-digit decimal integers into binary 
notation in O(M(n)logn) steps, where M(n) is an upper bound on the number of 
steps needed to multiply n-bit binary numbers that satisfies the “smoothness condition” 
M (2n) > 2M(n). 


15. [M47] Can the upper bound on the time to convert large integers given in the 
preceding exercise be substantially lowered? (See exercise 4.3.3-12.) 


16. [41] Construct a fast linear iterative array for radix conversion from decimal to 
binary (see Section 4.3.3E). 
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17. [M40] Design “ideal” floating point conversion subroutines, taking p-digit decimal 
numbers into P-digit binary numbers and vice versa, in both cases producing a true 
rounded result in the sense of Section 4.2.2. 


18. [HM34] (David W. Matula.) Let rounds (u, p) be the function of b, u, and p that 
represents the best p-digit base b floating point approximation to u, in the sense of 
Section 4.2.2. Under the assumption that log, b is irrational and that the range of 
exponents is unlimited, prove that 


u = round, (round g (u, P), p) 


holds for all p-digit base b floating point numbers u if and only if B’~' > bP. (In 
other words, an “ideal” input conversion of u into an independent base B, followed by 
an “ideal” output conversion of this result, will always yield u again if and only if the 
intermediate precision P is suitably large, as specified by the formula above.) 


19. [M23] Let the decimal number u = (u7 ...u1uo)10 be represented as the binary- 
coded decimal number U = (u7...uiuo)is. Find appropriate constants c; and masks m; 
so that the operation U + U —c;(U & mi), repeated for i = 1, 2, 3, will convert U to 
the binary representation of u, where “&” denotes extraction (bitwise AND). 
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4.5. RATIONAL ARITHMETIC 


IT IS OFTEN IMPORTANT to know that the answer to some numerical problem 
is exactly 1/3, not a floating point number that gets printed as “0.333333574”. 
If arithmetic is done on fractions instead of on approximations to fractions, 
many computations can be done entirely without any accumulated rounding 
errors. This results in a comfortable feeling of security that is often lacking when 
floating point calculations have been made, and it means that the accuracy of 
the calculation cannot be improved upon. 


Irrationality is the square root of all evil. 
— DOUGLAS HOFSTADTER, Metamagical Themas (1983) 


4.5.1. Fractions 


When fractional arithmetic is desired, the numbers can be represented as pairs 
of integers, (u/u’), where u and w’ are relatively prime to each other and wu’ > 0. 
The number zero is represented as (0/1). In this form, (u/u’) = (v/v’) if and 
only if u = v and wu! = v. 

Multiplication of fractions is, of course, easy; to form (u/u’) x (v/v’) 
(w/w’), we can simply compute uv and uv’. The two products uv and u’v 
might not be relatively prime, but if d = gcd(wv,wu’v’), the desired answer is 
w = uv/d, w = u'v'/d. (See exercise 2.) Efficient algorithms to compute the 
greatest common divisor are discussed in Section 4.5.2. 

Another way to perform the multiplication is to find dı = gcd(u,v’) and 
dz = gcd(u’, v); then the answer is w = (u/di)(v/d2), w = (u’/d2)(v'/di). (See 
exercise 3.) This method requires two gcd calculations, but it is not really slower 
than the former method; the gcd process involves a number of iterations that 
is essentially proportional to the logarithm of its inputs, so the total number of 
iterations needed to evaluate both dı and dz is essentially the same as the number 
of iterations during the single calculation of d. Furthermore, each iteration in 
the evaluation of dı and də is potentially faster, because comparatively small 
numbers are being examined. If u, u’, v, and v’ are single-precision quantities, 
this method has the advantage that no double-precision numbers appear in the 
calculation unless it is impossible to represent both of the answers w and w’ in 
single-precision form. 

Division may be done in a similar manner; see exercise 4. 

Addition and subtraction are slightly more complicated. The obvious pro- 
cedure is to set (u/u’) + (v/v’') = ((uv' + u'v)/u'v') and then to reduce this 
fraction to lowest terms by calculating d = gcd(uv’ + u’v, u'v’), as in the first 
multiplication method. But again it is possible to avoid working with such large 
numbers, if we start by calculating dı = gcd(w’, v’). If dı = 1, then the desired 
numerator and denominator are w = uv’ + u'v and w’ = w/v’. (According to 
Theorem 4.5.2D, dı will be 1 about 61 percent of the time, if the denominators 
u’ and v’ are randomly distributed, so it is wise to single out this case.) If 
dı > 1, then let t = u(v’/d,) + v(u'/d,) and calculate dz = ged(t, dı); finally the 
answer is w = t/do, w = (u’/d,)(v'/d2). (Exercise 6 proves that these values 


~ 
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of w and w’ are relatively prime to each other.) If single-precision numbers are 
being used, this method requires only single-precision operations, except that 
t may be a double-precision number or slightly larger (see exercise 7); since 
gcd(t,d,) = gced(t mod dı, dı), the calculation of dz does not require double 
precision. 

For example, to compute (7/66) + (17/12), we form dı = gcd(66, 12) = 6; 
then t = 7-24 17-11 = 201, and dz = gcd(201,6) = 3, so the answer is 
201 (= 12 


— — ) = 67/44. 
3 6 J / 


To help check out subroutines for rational arithmetic, inversion of matrices 
with known inverses (like Cauchy matrices, exercise 1.2.3-41) is suggested. 

Experience with fractional calculations shows that in many cases the num- 
bers grow to be quite large. So if u and wu’ are intended to be single-precision 
numbers for each fraction (u/u’), it is important to include tests for overflow 
in each of the addition, subtraction, multiplication, and division subroutines. 
For numerical problems in which perfect accuracy is important, a set of subrou- 
tines for fractional arithmetic with arbitrary precision allowed in numerator and 
denominator is very useful. 

The methods of this section extend also to other number fields besides the 
rational numbers; for example, we could do arithmetic on quantities of the form 
(u+u’/5)/u", where u, u’, u” are integers, gcd(u, u’, u”) = 1, and u” > 0; or 
on quantities of the form (u + u’ V2 + u” W4)/u!”, ete. 

Instead of insisting on exact calculations with fractions, it is interesting to 
consider also “fixed slash” and “floating slash” numbers, which are analogous to 
floating point numbers but based on rational fractions instead of radix-oriented 
fractions. In a binary fixed-slash scheme, the numerator and denominator of 
a representable fraction each consist of at most p bits, for some given p. Ina 
floating-slash scheme, the sum of numerator bits plus denominator bits must be 
a total of at most q, for some given q, and another field of the representation is 
used to indicate how many of these q bits belong to the numerator. Infinity can 
be represented as (1/0). To do arithmetic on such numbers, we define x @ y = 
round(x+y), r©y = round(x— y), etc., where round(x) = z if x is representable, 
otherwise it is one of the two representable numbers that surround zxz. 

It may seem at first that the best definition of round(«) would be to choose 
the representable number that is closest to x, by analogy with the way we round 
in floating point arithmetic. But experience has shown that it is best to bias our 
rounding towards “simple” numbers, since numbers with small numerator and 
denominator occur much more often than complicated fractions do. We want 
more numbers to round to ł than to 2. The rounding rule that turns out to 
be most successful in practice is called “mediant rounding”: If (u/u’) and (v/v') 
are adjacent representable numbers, so that whenever u/u’ < x < v/v’ we must 
have round(s) equal to (u/u’) or (v/v’), the mediant rounding rule says that 


utv 


u utv v 
round(2) = af for x < TET round(a) = a for x > 


a) 
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If x = (u+v)/(u’+v’) exactly, we let round(x) be the neighboring fraction with 
the smallest denominator (or, if u’ = v’, with the smallest numerator). Exercise 
4.5.3-43 shows that it is not difficult to implement mediant rounding efficiently. 

For example, suppose we are doing fixed slash arithmetic with p = 8, so 
that the representable numbers (u/w’) have —128 < u < 128 and 0 < u’ < 256 
and u L wu’. This isn’t much precision, but it is enough to give us a feel for 
slash arithmetic. The numbers adjacent to 0 = (0/1) are (—1/255) and (1/255); 
according to the mediant rounding rule, we will therefore have round(x) = 0 
if and only if |x| < 1/256. Suppose we have a calculation that would take the 
overall form 2 = 3i + E if we were working in exact rational arithmetic, but 
the intermediate quantities have had to be rounded to representable numbers. 
In this case #44 would round to (79/40) and #9 would round to (7/6). The 
rounded terms sum to z +% = a which rounds to (22/7); so we have obtained 
the correct answer even though three roundings were required. This example was 
not specially contrived. When the answer to a problem is a simple fraction, slash 
arithmetic tends to make the intermediate rounding errors cancel out. 

Exact representation of fractions within a computer was first discussed in 
the literature by P. Henrici, JACM 3 (1956), 6-9. Fixed and floating slash 
arithmetic were proposed by David W. Matula, in Applications of Number 
Theory to Numerical Analysis, edited by S. K. Zaremba (New York: Academic 
Press, 1972), 486-489. Further developments of the idea are discussed by Matula 
and Kornerup in Proc. IEEE Symp. Computer Arith. 4 (1978), 29-38, 39-47; 
Lecture Notes in Comp. Sci. 72 (1979), 383-397; Computing, Suppl. 2 (1980), 
85-111; IEEE Trans. C-32 (1983), 378-388; IEEE Trans. C-34 (1985), 3-18; 
IEEE Trans. C-39 (1990), 1106-1115. 


EXERCISES 


1. [15] Suggest a reasonable computational method for comparing two fractions, to 
test whether or not (u/u’) < (v/v’). 


2. [M15] Prove that if d = gcd(u,v) then u/d and v/d are relatively prime. 

3. [M20] Prove that u L u’ and v L v’ implies gcd(uv, uv’) = gcd(u, v’) gcd(u’, v). 

A. [11] Design a division algorithm for fractions, analogous to the second multipli- 
cation method of the text. (Note that the sign of v must be considered.) 

5. [10] Compute (17/120) + (—27/70) by the method recommended in the text. 

> 6. [M23] Show that u L u’ and v L v’ implies gcd(uv’ + vu’, u’v') = didz, where 

dı = ged(u',v’) and d2 = ged(di, u(v’/di) + v(u'/di)). (Hence if dı = 1 we have 
(uv’ + vu’) L uv.) 

7. [M22] How large can the absolute value of the quantity t become, in the addition- 
subtraction method recommended in the text, if the numerators and denominators of 
the inputs are less than N in absolute value? 


> 8. [22] Discuss using (1/0) and (—1/0) as representations for oo and —oo, and/or as 
representations of overflow. 


9. [M23] If1 <u’, v’ < 2”, show that |2?”u/u’| = |2?"v/v'| implies u/u’ = v/v’. 
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10. [41] Extend the subroutines suggested in exercise 4.3.1—34 so that they deal with 
“arbitrary” rational numbers. 

11. [M23] Consider fractions of the form (u+u'V/5)/u", where u, u’, u” are integers, 
gcd(u, u’, u”) = 1, and u” > 0. Explain how to divide two such fractions and to obtain 
a quotient having the same form. 

12. [M16] What is the largest finite floating slash number, given a bound q on the 
numerator length plus the denominator length? Which numbers round to (0/1)? 


13. [20] (Matula and Kornerup.) Discuss the representation of floating slash numbers 
in a 32-bit word. 


14. [M23] Explain how to compute the exact number of pairs of integers (u, u’) such 
that Mı < u < Mz and Ni < u’ < No and u L u’. (This can be used to determine how 
many numbers are representable in slash arithmetic. According to Theorem 4.5.2D, 
the number will be approximately (6/m)(M2z — M1)(No2 — N1ı).) 

15. [42] Modify one of the compilers at your installation so that it will replace all 
floating point calculations by floating slash calculations. Experiment with the use of 
slash arithmetic by running existing programs that were written by programmers who 
actually had floating point arithmetic in mind. (When special subroutines like square 
root or logarithm are called, your system should automatically convert slash numbers 
to floating point form before the subroutine is invoked, then back to slash form again 
afterwards. There should be a new option to print slash numbers in a fractional format; 
however, you should also print slash numbers in decimal notation as usual, if no changes 
are made to a user’s source program.) Are the results better or worse, when floating 
slash numbers are substituted? 


16. [40] Experiment with interval arithmetic on slash numbers. 


4.5.2. The Greatest Common Divisor 


If u and v are integers, not both zero, we say that their greatest common divisor, 
gcd(u, v), is the largest integer that evenly divides both u and v. This definition 
makes sense, because if u Æ 0 then no integer greater than |u| can evenly divide u, 
but the integer 1 does divide both u and v; hence there must be a largest integer 
that divides them both. When u and v are both zero, every integer evenly divides 
zero, so the definition above does not apply; it is convenient to set 


gcd(0,0) = 0. (1) 
The definitions just given obviously imply that 

gcd(u, v) = ged(v, u), (2) 

gcd(u, v) = ged(—u, v), (3) 

gcd(u, 0) = Jul. (4) 


In the previous section, we reduced the problem of expressing a rational 
number in lowest terms to the problem of finding the greatest common divisor 
of its numerator and denominator. Other applications of the greatest common 
divisor have been mentioned for example in Sections 3.2.1.2, 3.3.3, 4.3.2, 4.3.3. 
So the concept of gcd(u, v) is important and worthy of serious study. 
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The least common multiple of two integers u and v, written lem(u,v), is 
a related idea that is also important. It is defined to be the smallest positive 
integer that is an integer multiple of both u and v; and lem(u, 0) = lem(0, v) = 0. 
The classical method for teaching children how to add fractions u/u’ + v/v’ is 
to train them to find the “least common denominator,” which is lem(u’, v’). 

According to the “fundamental theorem of arithmetic” (proved in exercise 
1.2.4-21), each positive integer u can be expressed in the form 


c= | w, (5) 
p prime 
where the exponents u2, u3, ... are uniquely determined nonnegative integers, 


and where all but a finite number of the exponents are zero. From this canonical 
factorization of a positive integer, we immediately obtain one way to compute 
the greatest common divisor of u and v: By (2), (3), and (4), we may assume 
that u and v are positive integers, and if both of them have been canonically 
factored into primes we have 


gcd( u, v) = JI pmin( Up, vp) (6) 
p prime 

lcm( (u, v) = JI pmax( Up, Up). (7) 
p prime 


Thus, for example, the greatest common divisor of u = 7000 = 2° - 53 . 7 and 
v = 4400 = 94 . 52 -11 is omin(3,4) 5min(3,2) 7min(1,0) ypmin(0,1) = 2. 52 = 200. The 
least common multiple of the same two numbers is 24 - 53 - 7-11 = 154000. 

From formulas (6) and (7) we can easily prove a number of basic identities 
concerning the gcd and the lcm: 


gcd(u, v)w = gcd(uw, vw), if w > 0; (8) 

Iem(u, v)w = lem(uw, vw), if w > 0; (9) 

u: v = gcd(u, v) -lem(u, v), if u,v > 0; (10) 

gcd (Icem(u, v), lem(u, w)) = lem (u, ged (v, w) ); (11) 
lem(ged(u, v), ged(u, w)) = gcd (u, lem(v, w)). (12) 


The latter two formulas are “distributive laws” analogous to the familiar identity 
uv + uw = u(v + w). Equation (10) reduces the calculation of gcd(u,v) to the 
calculation of lem(u,v), and conversely. 


Euclid’s algorithm. Although Eq. (6) is useful for theoretical purposes, it is 
generally no help for calculating a greatest common divisor in practice, because 
it requires that we first determine the canonical factorization of u and v. There 
is no known way to find the prime factors of an integer very rapidly (see Section 
4.5.4). But fortunately the greatest common divisor of two integers can be found 
efficiently without factoring, and in fact such a method was discovered more 
than 2250 years ago; it is Fuclid’s algorithm, which we have already examined 
in Sections 1.1 and 1.2.1. 
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Euclid’s algorithm is found in Book 7, Propositions 1 and 2 of his Elements 
(c. 300 B.C.), but it probably wasn’t his own invention. Some scholars believe 
that the method was known up to 200 years earlier, at least in its subtractive 
form, and it was almost certainly known to Eudoxus (c. 375 B.C.); see K. von 
Fritz, Ann. Math. (2) 46 (1945), 242-264. Aristotle (c. 330 B.C.) hinted at it 
in his Topics, 158b, 29-35. However, very little hard evidence about such early 
history has survived [see W. R. Knorr, The Evolution of the Euclidean Elements 
(Dordrecht: 1975)]. 

We might call Euclid’s method the granddaddy of all algorithms, because it 
is the oldest nontrivial algorithm that has survived to the present day. (The chief 
rival for this honor is perhaps the ancient Egyptian method for multiplication, 
which was based on doubling and adding, and which forms the basis for efficient 
calculation of nth powers as explained in Section 4.6.3. But the Egyptian 
manuscripts merely give examples that are not completely systematic, and the 
examples were certainly not stated systematically; the Egyptian method is there- 
fore not quite deserving of the name “algorithm.” Several ancient Babylonian 
methods, for doing such things as solving special sets of quadratic equations in 
two variables, are also known. Genuine algorithms are involved in this case, 
not just special solutions to the equations for certain input parameters; even 
though the Babylonians invariably presented each method in conjunction with an 
example worked with particular input data, they regularly explained the general 
procedure in the accompanying text. [See D. E. Knuth, CACM 15 (1972), 671- 
677; 19 (1976), 108.] Many of these Babylonian algorithms predate Euclid by 
1500 years, and they are the earliest known instances of written procedures for 
mathematics. But they do not have the stature of Euclid’s algorithm, since 
they do not involve iteration and since they have been superseded by modern 
algebraic methods.) 

In view of the importance of Euclid’s algorithm, for historical as well as 
practical reasons, let us now consider how Euclid himself treated it. Paraphrased 
into modern terminology, this is essentially what he wrote: 


Proposition. Given two positive integers, find their greatest common divisor. 


Let A and C be the two given positive integers; it is required to find their greatest 
common divisor. If C divides A, then C is a common divisor of C and A, since it 
also divides itself. And it clearly is in fact the greatest, since no greater number 
than C will divide C. 


But if C does not divide A, then continually subtract the lesser of the numbers 
A, C from the greater, until some number is left that divides the previous one. 
This will eventually happen, for if unity is left, it will divide the previous number. 


Now let E be the positive remainder of A divided by C; let F be the positive 
remainder of C divided by E; and suppose that F is a divisor of Æ. Since F 
divides E and E divides C — F, F also divides C — F; but it also divides itself, 
so it divides C. And C divides A — E; therefore F also divides A — E. But it also 
divides E; therefore it divides A. Hence it is a common divisor of A and C. 


I now claim that it is also the greatest. For if F is not the greatest common divisor 
of A and C, some larger number will divide them both. Let such a number be G. 
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Now since G divides C while C divides A— E, G divides A— E. G also divides the 
whole of A, so it divides the remainder E. But E divides C — F; therefore G also 
divides C — F. And G also divides the whole of C, so it divides the remainder F;; 
that is, a greater number divides a smaller one. This is impossible. 

Therefore no number greater than F will divide A and C, so F is their greatest 
common divisor. 

Corollary. This argument makes it evident that any number dividing two num- 
bers divides their greatest common divisor. Q.E.D. 


Euclid’s statements have been simplified here in one nontrivial respect: Greek 
mathematicians did not regard unity as a “divisor” of another positive integer. 
Two positive integers were either both equal to unity, or they were relatively 
prime, or they had a greatest common divisor. In fact, unity was not even 
considered to be a “number,” and zero was of course nonexistent. These rather 
awkward conventions made it necessary for Euclid to duplicate much of his 
discussion, and he gave two separate propositions that are each essentially like 
the one appearing here. 

In his discussion, Euclid first suggests subtracting the smaller of the two 
current numbers from the larger, repeatedly, until we get two numbers where one 
is a multiple of the other. But in the proof he really relies on taking the remainder 
of one number divided by another; and since he has no simple concept of zero, 
he cannot speak of the remainder when one number divides the other. It is 
reasonable to say that he imagines each division (not the individual subtractions) 
as a single step of the algorithm, and hence an “authentic” rendition of his 
algorithm can be phrased as follows: 


Algorithm E (Original Euclidean algorithm). Given two integers A and C 
greater than unity, this algorithm finds their greatest common divisor. 


E1. [Is A divisible by C?] If C divides A, the algorithm terminates with C as 
the answer. 


E2. [Replace A by remainder.] If A mod C is equal to unity, the given numbers 
were relatively prime, so the algorithm terminates. Otherwise replace the 
pair of values (A, C) by (C, A mod C) and return to step El. J 


Euclid’s “proof” quoted above is especially interesting because it is not really 
a proof at all! He verifies the result of the algorithm only if step E1 is performed 
once or thrice. Surely he must have realized that step E1 could take place more 
than three times, although he made no mention of such a possibility. Not having 
the notion of a proof by mathematical induction, he could only give a proof for a 
finite number of cases. (In fact, he often proved only the case n = 3 of a theorem 
that he wanted to establish for general n.) Although Euclid is justly famous for 
the great advances he made in the art of logical deduction, techniques for giving 
valid proofs by induction were not discovered until many centuries later, and the 
crucial ideas for proving the validity of algorithms are only now becoming really 
clear. (See Section 1.2.1 for a complete proof of Euclid’s algorithm, together 
with a short discussion of general proof procedures for algorithms.) 
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It is worth noting that this algorithm for finding the greatest common divisor 
was chosen by Euclid to be the very first step in his development of the theory 
of numbers. The same order of presentation is still in use today in modern 
textbooks. Euclid also gave a method (Proposition 34) to find the least common 
multiple of two integers u and v, namely to divide u by gcd(u, v) and to multiply 
the result by v; this is equivalent to Eq. (10). 

If we avoid Euclid’s bias against the numbers 0 and 1, we can reformulate 
Algorithm E in the following way. 


Algorithm A (Modern Euclidean algorithm). Given nonnegative integers u 

and v, this algorithm finds their greatest common divisor. (Note: The greatest 

common divisor of arbitrary integers u and v may be obtained by applying this 

algorithm to |u| and |v|, because of Eqs. (2) and (3).) 

Al. [v = 0?] If v = 0, the algorithm terminates with u as the answer. 

A2. [Take umodv.] Set r+ umodv, u + v, v + r, and return to Al. (The 
operations of this step decrease the value of v, but they leave gcd(u, v) 
unchanged.) J 


For example, we may calculate gcd(40902, 24140) as follows: 
gcd(40902, 24140) = gced(24140, 16762) = gcd(16762, 7378) 
= gcd(7378, 2006) = gcd(2006, 1360) = gcd(1360, 646) 
= gcd(646, 68) = gced(68, 34) = gced(34, 0) = 34. 
The validity of Algorithm A follows readily from Eq. (4) and the fact that 
gcd(u, v) = gcd(v, u — qv), (13) 


if q is any integer. Equation (13) holds because any common divisor of u and v 
is a divisor of both v and u — qv, and, conversely, any common divisor of v and 
u — qu must divide both u and v. 

The following MIX program illustrates the fact that Algorithm A can easily 
be implemented on a computer: 


Program A (Fuclid’s algorithm). Assume that u and v are single-precision, 
nonnegative integers, stored respectively in locations U and V; this program puts 
gcd(u,v) into rA. 


LDX U rX + wu. 
JMP 2F 
1H STX V v erx. 


1 
1 
T 
SRAX 5 T rAX + rA. 
DIV V T rX + rAX mod v. 
2H LDA V 1+T rA¢uv. 
JXNZ 1B 1+7 DoneifrX=0. J 


The running time for this program is 19T + 6 cycles, where T is the number 
of divisions performed. The discussion in Section 4.5.3 shows that we may take 
T = 0.8427661ln N + 0.06 as an approximate average value, when u and v are 
independently and uniformly distributed in the range 1 < u,v < N. 
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A binary method. Since Euclid’s patriarchal algorithm has been used for so 
many centuries, it is rather surprising that it might not be the best way to find 
the greatest common divisor after all. A quite different gcd algorithm, primarily 
suited to binary arithmetic, was devised by Josef Stein in 1961 [see J. Comp. 
Phys. 1 (1967), 397-405]. This new algorithm requires no division instruction; it 
relies solely on the operations of subtraction, parity testing, and halving of even 
numbers (which corresponds to a right shift in binary notation). 

The binary gcd algorithm is based on four simple facts about positive inte- 
gers u and v: 

a) If u and v are both even, then gced(u, v) = 2 gcd(u/2,v/2). [See Eq. (8).] 

b) If u is even and v is odd, then gced(u, v) = gcd(u/2,v). [See Eq. (6).] 

c) As in Euclid’s algorithm, gcd(u, v) = ged(u — v, v). [See Eqs. (13), (2).] 

d) If u and v are both odd, then u — v is even, and |u — v| < max(u, v). 
Algorithm B (Binary gcd algorithm). Given positive integers u and v, this 
algorithm finds their greatest common divisor. 

B1. [Find power of 2.] Set k + 0, and then repeatedly set k + k+1,u¢ u/2, 
v + v/2, zero or more times until u and v are not both even. 
B2. [Initialize.] (Now the original values of u and v have been divided by 2%, 
and at least one of their present values is odd.) If u is odd, set t + —v and 
go to B4. Otherwise set t + u. 
B3. [Halve t.] (At this point, t is even, and nonzero.) Set t < t/2. 
B4. [Is t even?] If t is even, go back to B3. 
B5. [Reset max(u, v).] If t > 0, set u + t; otherwise set v + —t. (The larger of 
u and v has been replaced by |t|, except perhaps during the first time this 
step is performed.) 
B6. [Subtract.] Set t + u—v. If t 40, go back to B3. Otherwise the algorithm 
terminates with u- 2" as the output. | 
As an example of Algorithm B, let us consider u = 40902, v = 24140, the 
same numbers we used when trying out Euclid’s algorithm. Step B1 sets k + 1, 
u + 20451, v + 12070. Then t is set to —12070, and replaced by —6035; then v 
is replaced by 6035, and the computation proceeds as follows: 


u v t 
20451 6035 14416, +7208, +3604, +1802, +901; 
901 6035 —5134, —2567; 
901 2567 —1666, —833; 
901 833 +68, +34, +17; 
17 833 816, —408, —204, —102, —51; 
17 51 —34, —17; 
17 17 0. 


The answer is 17 - 2! = 34. A few more iterations were necessary here than 
we needed with Algorithm A, but each iteration was somewhat simpler since no 
division steps were used. 
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| 


B1. Find power of 2 | 


| LO Ne 


B2. Initialize > B3. Halve t B5. Reset max(u, v) 


Fig. 9. Binary algorithm for the greatest common divisor. 


A MIX program for Algorithm B requires a bit more code than for Algo- 
rithm A, but the steps are elementary. In order to make such a program fairly 
typical of a binary computer’s representation of Algorithm B, let us assume that 
MIX is extended to include the following operators: 


e SLB (shift left AX binary). C = 6; F = 6. 

The contents of registers A and X are “shifted left” M binary places; that is, 
jrAX| + |2“rAX|mod B!°, where B is the byte size. (As with all MIX shift 
commands, the signs of rA and rX are not affected.) 


e SRB (shift right AX binary). C = 6; F = 7. 

The contents of registers A and X are “shifted right” M binary places; that is, 
jrAX] + ||rAX|/2™ J. 

e JAE, JAO (jump A even, jump A odd). C = 40; F = 6, 7, respectively. 

A JMP occurs if rA is even or odd, respectively. 


e JXE, JXO (jump X even, jump X odd). C = 47; F = 6, 7, respectively. 
Analogous to JAE, JAO. 


Program B (Binary gcd algorithm). Assume that u and v are single-precision 
positive integers, stored respectively in locations U and V; this program uses 
Algorithm B to put gcd(u, v) into rA. Register assignments: rA = t, rll = k. 


01 ABS EQU 1:5 


02 Bi ENT1 O 1 B1. Find power of 2. 

03 LDX U 1 rX & u. 

04 LDAN V 1 rA + —v. 

05 JMP 1F 1 

06 2H SRB 1 A Halve rA, rX. 

07 INC1 1 A ki k+1. 

08 STX U A ue u/2. 

09 STA VCABS) A v + v/2. 

10 1H JXO B4 1+A To B4 with t + —v if u is odd. 


11 B2 JAE 2B BA B2. Initialize. 
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12 LDA U B teu. 

18 B3 SRB 1 D B3. Halve t. 

14 B4 JAE B3 1—B+D_ B4. Ist even? 

15 B5 JAN iF C B5. Reset max(u, v). 
16 STA U E If t > 0, set u + t. 
17 SUB V E t u-v. 

18 JMP 2F E 

19 1H STA V(ABS) C-E If t < 0, set v + —t. 
20 B6 ADD U C-E B6. Subtract. 

21 2H JANZ B3 C To B3 if t 4 0. 

22 LDA U 1 rA€ u. 

23 ENTX 0 1 rX + 0. 

24 SLB 0,1 1 rA¢ 2*.rA. Il 


The running time of this program is 
9A+2B+6C+3D+E+13 


units, where A= k, B = 1 if t + u in step B2 (otherwise B = 0), C is the 
number of subtraction steps, D is the number of halvings in step B3, and EF is 
the number of times t > 0 in step B5. Calculations discussed later in this section 
imply that we may take A = 3, B = 4, C =0.71N —0.5, D=1.41N — 2.7, and 
E = 0.35N — 0.4 as average values for these quantities, assuming random inputs 
u and v in the range 1 < u,v < 2%. The total running time is therefore about 
8.8N +5.2 cycles, compared to about 11.1N+7.1 for Program A under the same 
assumptions. The worst possible running time for u and v in this range occurs 
when A=0, B=1,C =N, D=2N—2, E= N —1; this amounts to 13N +8 
cycles. (The corresponding value for Program A is 26.8N + 19.) 

Thus the greater speed of the iterations in Program B, due to the simplicity 
of the operations, compensates for the greater number of iterations required. We 
have found that the binary algorithm is about 20 percent faster than Euclid’s 
algorithm on the MIX computer. Of course, the situation may be different 
on other computers, and in any event both programs are quite efficient; but 
it appears that not even a procedure as venerable as Euclid’s algorithm can 
withstand progress. 

The binary gcd algorithm itself might have a distinguished pedigree, since 
it may well have been known in ancient China. Chapter 1, Section 6 of a classic 
text called Chiu Chang Suan Shu, the “Nine Chapters on Arithmetic” (c. Ist 
century A.D.), gives the following method for reducing a fraction to lowest terms: 


If halving is possible, take half. 

Otherwise write down the denominator and the numerator, and subtract the 

smaller from the greater. 

Repeat until both numbers are equal. 

Simplify with this common value. 
If the repeat instruction means to go back to the halving step instead of to 
repeat the subtraction step—this point isn’t clear — the method is essentially 
Algorithm B. [See Y. Mikami, The Development of Mathematics in China 
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and Japan (Leipzig: 1913), 11; K. Vogel, Neun Bücher arithmetischer Technik 
(Braunschweig: Vieweg, 1968), 8.] 

V. C. Harris [Fibonacci Quarterly 8 (1970), 102-103; see also V. A. Le- 
besgue, J. Math. Pures Appl. 12 (1847), 497-520] has suggested an interesting 
cross between Euclid’s algorithm and the binary algorithm. If u and v are odd, 
with u > v > 0, we can always write 


U=quir 


where 0 < r < v and r is even; if r 4 0 we set r + r/2 until r is odd, then set 
u + v, v + r and repeat the process. In subsequent iterations, q > 3. 


Extensions. We can extend the methods used to calculate gcd(u, v) in order to 
solve some slightly more difficult problems. For example, assume that we want 
to compute the greatest common divisor of n integers u1, U2, ..., Un. 

One way to calculate gcd(u1, u2,..., Un), assuming that the u’s are all 
nonnegative, is to extend Euclid’s algorithm in the following way: If all uj are 
zero, the greatest common divisor is taken to be zero; otherwise if only one uy is 
nonzero, it is the greatest common divisor; otherwise replace ug by ug mod uj for 
all k Æ j, where u; is the minimum of the nonzero w’s, and repeat the process. 

The algorithm sketched in the preceding paragraph is a natural generaliza- 
tion of Euclid’s method, and it can be justified in a similar manner. But there 
is a simpler method available, based on the easily verified identity 


ged(ur, U2,--- thn) E gcd (ur, gcd(ue, Pa Sy un)). (14) 
To calculate gcd(u1, u2,...,Un), we may therefore proceed as follows: 


Algorithm C (Greatest common divisor of n integers). Given integers u1, u2, 
..; Un, Where n > 1, this algorithm computes their greatest common divisor, 
using an algorithm for the case n = 2 as a subroutine. 


C1. Set d+ un, ke n-1. 


C2. If d #1 and k > 0, set d+ gcd(uz,d) and k + k—1 and repeat this step. 
Otherwise d = gcd(w,...,Un). I 


This method reduces the calculation of gcd(u1,..., Un) to repeated calculations 
of the greatest common divisor of two numbers at a time. It makes use of the fact 
that gcd(u1,..., Uk, 1) = 1; and this will be helpful, since we will already have 
gcd(tn—1,Un) = 1 more than 60 percent of the time, if un—ı and un are chosen 
at random. In most cases the value of d will decrease rapidly during the first few 
stages of the calculation, and this will make the remainder of the computation 
quite fast. Here Euclid’s algorithm has an advantage over Algorithm B, because 
its running time is primarily governed by the value of min(u, v), while the running 
time for Algorithm B is primarily governed by max(u, v); it would be reasonable 
to perform one iteration of Euclid’s algorithm, replacing u by u mod v if u is 
much larger than v, and then to continue with Algorithm B. 
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The assertion that gcd(uyn_1, Un) will be equal to unity more than 60 percent 
of the time for random inputs is a consequence of the following well-known result 
of number theory: 


Theorem D. [G. Lejeune Dirichlet, Abhandlungen Königlich Preuß. Akad. 
Wiss. (1849), 69-83.] If u and v are integers chosen at random, the probability 
that gcd(u, v) = 1 is 6/7? ~ .60793. 


A precise formulation of this theorem, which defines carefully what is meant 
by being “chosen at random,” appears in exercise 10 with a rigorous proof. Let 
us content ourselves here with a heuristic argument that shows why the theorem 
is plausible. 

If we assume, without proof, the existence of a well-defined probability p 
that u L v, then we can determine the probability that gcd(u,v) = d for any 
positive integer d, because gcd(u, v) = d if and only if u is a multiple of d and 
v is a multiple of d and u/d L v/d. Thus the probability that gcd(u, v) = d is 
equal to 1/d times 1/d times p, namely p/d?. Now let us sum these probabilities 
over all possible values of d; we should get 


1=5p/=p(14}+ +t) 
d>1 


Since the sum 1+ +4 +- = HÊ? is equal to 7?/6 by Eq. 1.2.7-(7), we need 
p = 6/7? in order to make this equation come out right. | 


Euclid’s algorithm can be extended in another important way: We can 
calculate integers u’ and v’ such that 

uu’ + vv = gcd(u, v) (15) 

at the same time gcd(u, v) is being calculated. This extension of Euclid’s algo- 


rithm can be described conveniently in vector notation: 


Algorithm X (Extended Euclid’s algorithm). Given nonnegative integers u 
and v, this algorithm determines a vector (u1, U2, u3) such that wu, + vug = 
uz = gcd(u,v). The computation makes use of auxiliary vectors (v1, v2, v3), 
(tı, t2, t3); all vectors are manipulated in such a way that the relations 

ut; + vtz = t3, uu + vuz = UZ, uv, + Vv = U3 (16) 
hold throughout the calculation. 
X1. [Initialize.] Set (u1, u2, u3) 4+ (1,0, u), (v1, v2, v3) + (0,1, v). 
X2. [Is v3 = 0?] If v3 = 0, the algorithm terminates. 
X3. [Divide, subtract.] Set q < |u3/v3], and then set 

(t1, ta, t3) — (ui, ua, us) — (v1, v2, v3)q, 
(u1, U2, U3) (1, V2, U3), (v1, v2, v3) + (tı, ta, ts). 


Return to step X2. I 
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For example, let u = 40902, v = 24140. At step X2 we have 


q U1 U2 U3 U1 v2 U3 
i 1 0 40902 0 1 24140 
1 0 1 24140 1 —1 16762 
1 1 —1 16762 —1 2 7378 
2 —1 2 7378 3 —5 2006 
3 3 —5 2006 —10 17 1360 
1 —10 17 1360 13 —22 646 
2 13 —22 646 —36 61 68 
9 —36 61 68 337 —571 34 
2 337 —571 34 —710 1203 0 


The solution is therefore 337 - 40902 — 571 - 24140 = 34 = gcd(40902, 24140). 


Algorithm X can be traced to the Aryabhatiya (A.D. 499) by Aryabhata of 
northern India. His description was rather cryptic, but later commentators such 
as Bhaskara I in the seventh century clarified the rule, which was called kuttaka 
(“the pulverizer”). [See B. Datta and A. N. Singh, History of Hindu Mathematics 
2 (Lahore: Motilal Banarsi Das, 1938), 89-116.] Its validity follows from (16) 
and the fact that the algorithm is identical to Algorithm A with respect to 
its manipulation of u3 and v3; a detailed proof of Algorithm X is discussed in 
Section 1.2.1. Gordon H. Bradley has observed that we can avoid a good deal 
of the calculation in Algorithm X by suppressing u2, v2, and t2; then uz can be 
determined afterwards using the relation wu, + vuz = ug. 

Exercise 15 shows that the values of |u1|, |w2|, |v1|, and |v2| remain bounded 
by the size of the inputs u and v. Algorithm B, which computes the greatest 
common divisor using properties of binary notation, can be extended in a similar 
way; see exercise 39. For some instructive extensions to Algorithm X, see 
exercises 18 and 19 in Section 4.6.1. 


The ideas underlying Euclid’s algorithm can also be applied to find a general 
solution in integers of any set of linear equations with integer coefficients. For 
example, suppose that we want to find all integers w, x, y, z that satisfy the two 


equations 
10w + 3a + 3y + 8z = 1, (17) 


6w — Tx —5z=2. (18) 
We can introduce a new variable 
[10/3] w + |[3/3|a + [3/3]y + [8/3]z = 3w + £ + y + 2z = tı, 
and use it to eliminate y; Eq. (17) becomes 
(10 mod 3)w + (3 mod 3)a + 3tı + (8 mod 3)z = w + 3tı + 2z = 1, (19) 


and Eq. (18) remains unchanged. The new equation (19) may be used to elim- 
inate w, and (18) becomes 


6(1 — 3t, — 2z) — Tz — 5z = 2; 
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7x + 18t; + 17z = 4. (20) 
Now as before we introduce a new variable 
a+ 2t; +22 = to 
and eliminate x from (20): 
Tta + 4tı +32 = 4. (21) 


Another new variable can be introduced in the same fashion, in order to eliminate 
the variable z, which has the smallest coefficient: 


2tə2 + tı + z = ts. 
Eliminating z from (21) yields 
tg + tı + 3t3 = 4, (22) 


and this equation, finally, can be used to eliminate tə. We are left with two 
independent variables, tı and t3; substituting back for the original variables, we 
obtain the general solution 

w= 17 - 5ty = 14¢ts, 

x= 20- 5ti = 17ts, 

y= 55 19t; 45t3, 

a> 8 ty 7t3. 


(23) 


In other words, all integer solutions (w,x,y,z) to the original equations (17) 
and (18) are obtained from (23) by letting tı and t3 independently run through 
all integers. 

The general method that has just been illustrated is based on the following 
procedure: Find a nonzero coefficient c of smallest absolute value in the system 
of equations. Suppose that this coefficient appears in an equation having the 
form 


C£o +14, +-+- + Cktk = d; (24) 


and assume for simplicity that c > 0. If c = 1, use this equation to eliminate 
the variable £o from the other equations remaining in the system; then repeat 
the procedure on the remaining equations. (If no more equations remain, the 
computation stops, and a general solution in terms of the variables not yet 
eliminated has essentially been obtained.) If c > 1, then if cı mod c =--- = 
Ck mod c = 0 check that d mod c = 0, otherwise there is no integer solution; then 
divide both sides of (24) by c and eliminate xo as in the case c = 1. Finally, 
if c > 1 and not all of cı mod c, ..., Ck mod c are zero, then introduce a new 
variable 


[e/c]zo + Ler/elai +--+ + Lex/clan = t; (25) 
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eliminate the variable zo from the other equations, in favor of t, and replace the 
original equation (24) by 


ct + (cı mod c)z1 +--+ + (cx mod c)£k = d. (26) 


(See (19) and (21) in the example above.) 

This process must terminate, since each step reduces either the number of 
equations or the size of the smallest nonzero coefficient in the system. When this 
procedure is applied to the equation uz + vy = 1, for specific integers u and v, 
it runs through essentially the steps of Algorithm X. 

The transformation-of-variables procedure just explained is a simple and 
straightforward way to solve linear equations when the variables are allowed 
to take on integer values only, but it isn’t the best method available for this 
problem. Substantial refinements are possible, but beyond the scope of this 
book. [See Henri Cohen, A Course in Computational Algebraic Number Theory 
(New York: Springer, 1993), Chapter 2.] 


Variants of Euclid’s algorithm can be used also with Gaussian integers u+iu’ 
and in certain other quadratic number fields. See, for example, A. Hurwitz, Acta 
Math. 11 (1887), 187-200; E. Kaltofen and H. Rolletschek, Math. Comp. 53 
(1989), 697-720; A. Knopfmacher and J. Knopfmacher, BIT 31 (1991), 286- 
292. 


High-precision calculation. If u and v are very large integers, requiring a 
multiple-precision representation, the binary method (Algorithm B) is a simple 
and fairly efficient means of calculating their greatest common divisor, since it 
involves only subtractions and shifting. 

By contrast, Euclid’s algorithm seems much less attractive, since step A2 
requires a multiple-precision division of u by v. But this difficulty is not really 
as bad as it seems, since we will prove in Section 4.5.3 that the quotient |u/v| is 
almost always very small. For example, assuming random inputs, the quotient 
|u/v| will be less than 1000 approximately 99.856 percent of the time. Therefore 
it is almost always possible to find |u/v| and (u mod v) using single-precision 
calculations, together with the comparatively simple operation of calculating 
u — qu where q is a single-precision number. Furthermore, if it does turn out 
that u is much larger than v (for instance, the initial input data might have this 
form), we don’t really mind having a large quotient q, since Euclid’s algorithm 
makes a great deal of progress when it replaces u by u mod v in such a case. 

A significant improvement in the speed of Euclid’s algorithm when high- 
precision numbers are involved can be achieved by using a method due to D. H. 
Lehmer [AMM 45 (1938), 227-233]. Working only with the leading digits of 
large numbers, it is possible to do most of the calculations with single-precision 
arithmetic, and to make a substantial reduction in the number of multiple- 
precision operations involved. The idea is to save time by doing a “virtual” 
calculation instead of the actual one. 

For example, let us consider the pair of eight-digit numbers u = 27182818, 
v = 10000000, assuming that we are using a machine with only four-digit words. 
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Let u’ = 2718, v’ = 1001, u” = 2719, v” = 1000; then u’/v’ and u”/v” are 
approximations to u/v, with 


ule! < ufo < ufu". (27) 


The ratio u/v determines the sequence of quotients obtained in Euclid’s algo- 
rithm. If we perform Euclid’s algorithm simultaneously on the single-precision 
values (u’, v’) and (u”, v”) until we get a different quotient, it is not difficult to 
see that the same sequence of quotients would have appeared to this point if 
we had worked with the multiple-precision numbers (u,v). Thus, consider what 
happens when Euclid’s algorithm is applied to (u’, v’) and to (u”, v”): 


£ F / n FE A 


u v q u v q 
2718 1001 2 2719 1000 2 
1001 716 1 1000 719 1 
716 285 2 719 281 2 
285 146 1 281 157 1 
146 139 1 157 124 1 
139 7 19 124 33 3 


The first five quotients are the same in both cases, so they must be the true ones. 
But on the sixth step we find that q’ 4 q”, so the single-precision calculations 
are suspended. We have gained the knowledge that the calculation would have 
proceeded as follows if we had been working with the original multiple-precision 
numbers: 


u U q 
uo VO 2 
Vo ugo — 2vo 1 

uo — 2vo —Uo T 3vo 2 (28) 
—Uo + 3vo 3uo — 80 1 
3u0 — 8vo —4uo EE 1lvo 1 
—4uo T 1luo Tuo sm 19vo ? 


(The next quotient lies somewhere between 3 and 19.) No matter how many 
digits are in u and v, the first five steps of Euclid’s algorithm would be the same 
as (28), so long as (27) holds. We can therefore avoid the multiple-precision 
operations of the first five steps, and replace them all by a multiple-precision 
calculation of —4uo + 11vo and Tuo — 19vo. In this case we obtain u = 1268728, 
v = 279726; the calculation can now continue in a similar manner with u’ = 1268, 
v’ = 280, u” = 1269, v” = 279, etc. If we had a larger accumulator, more steps 
could be done by single-precision calculations. Our example showed that only 
five cycles of Euclid’s algorithm were combined into one multiple step, but with 
(say) a word size of 10 digits we could do about twelve cycles at a time. Results 
proved in Section 4.5.3 imply that the number of multiple-precision cycles that 
can be replaced at each iteration is essentially proportional to the number of 
digits used in the single-precision calculations. 
Lehmer’s method can be formulated as follows: 
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Algorithm L (Fuclid’s algorithm for large numbers). Let u and v be nonnegative 
integers, with u > v, represented in multiple precision. This algorithm computes 
the greatest common divisor of u and v, making use of auxiliary single-precision 
p-digit variables ĉ, 6, A, B, C, D, T, q, and auxiliary multiple-precision variables 
t and w. 


L1. 


L2. 


L3. 


L4. 


[Initialize.] If v is small enough to be represented as a single-precision 
value, calculate gcd(u,v) by Algorithm A and terminate the computation. 
Otherwise, let t be the p leading digits of u, and let ô be the corresponding 
digits of v; in other words, if radix-b notation is being used, & + |u/b*| and 
ô < |v/b*|, where k is as small as possible consistent with the condition 
a < bP. 

Set A + 1, B + 0, C + 0, D + 1. (These variables represent the 
coefficients in (28), where 


u = Auo + Buoy, and v= Cuo + Dvo, (29) 


in the equivalent actions of Algorithm A on multiple-precision numbers. We 
also have 


w=0+B, v=0+D, w'=û+A, w'=0+C (30) 


in terms of the notation in the example worked above.) 


[Test quotient.] Set q — [(&@+ A)/(6+C)|. q # |(â + B)/@ + D)J, 
go to step L4. (This step tests if q' Æ q”, in the notation of the example 
above. Single-precision overflow can occur in special circumstances during 
the computation in this step, but only when ĉ = b? — 1 and A = 1 or when 
ô = bP — 1 and D = 1; the conditions 


O0<û+A< D, 
0<î+B<b’, 


b+C <b, 
+D <b 


(31) 


will always hold, because of (30). It is possible to have 6+C = 0 or 6+D = 0, 
but not both simultaneously; therefore division by zero in this step is taken 
to mean “Go directly to L4.”) 


[Emulate Euclid.] Set T+ A-—qC, A} C, C &T,T + B-qD, BCD, 
D +T, T + ù -— qô, à + ô, ĉ + T, and go back to step L2. (These single- 
precision calculations are the equivalent of multiple-precision operations, as 
in (28), under the conventions of (29).) 


[Multiprecision step.] If B = 0, set t ~ umodv, u + v, v + t, using 
multiple-precision division. (This happens only if the single-precision oper- 
ations cannot simulate any of the multiple-precision ones. It implies that 
Euclid’s algorithm requires a very large quotient, and this is an extremely 
rare occurrence.) Otherwise, set t + Au, t + t+ Bv, w + Cu, w + w+ Dv, 
u + t, v © w (using straightforward multiple-precision operations). Go 
back to step Ll. J 
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The values of A, B, C, D remain as single-precision numbers throughout 
this calculation, because of (31). 

Algorithm L requires a somewhat more complicated program than Algo- 
rithm B, but with large numbers it will be faster on many computers. The 
binary technique of Algorithm B can, however, be speeded up in a similar way 
(see exercise 38), to the point where it continues to win. Algorithm L has the 
advantage that it determines the sequence of quotients obtained in Euclid’s algo- 
rithm, and this sequence has numerous applications (see, for example, exercises 
43, 47, 49, and 51 in Section 4.5.3). See also exercise 4.5.3-46. 


* Analysis of the binary algorithm. Let us conclude this section by studying 
the running time of Algorithm B, in order to justify the formulas stated earlier. 
An exact determination of Algorithm B’s behavior appears to be exceedingly 
difficult to derive, but we can begin to study it by means of an approximate 
model. Suppose that u and v are odd numbers, with u > v and 


lgu] =m, llgv| =n. (32) 


(Thus, u is an (m + 1)-bit number, and v is an (n + 1)-bit number.) Consider 
a subtract-and-shift cycle of Algorithm B, namely an operation that starts at 
step B6 and then stops after step B5 is finished. Every subtract-and-shift cycle 
with u > v forms u — v and shifts this quantity right until obtaining an odd 
number u’ that replaces u. Under random conditions, we would expect to have 
u’ = (u — v)/2 about one-half of the time, u’ = (u — v)/4 about one-fourth of 
the time, u’ = (u — v)/8 about one-eighth of the time, and so on. We have 


[lg u’] =m—k-r, (33) 


where k is the number of places that u — v is shifted right, and where r is 
[lgu] — |Ig(u — v)], the number of bits lost at the left during the subtraction of 
v from u. Notice that r < 1 when m > n +2, and r > 1 when m=n. 

The interaction between k and r is quite messy (see exercise 20), but Richard 
Brent discovered a nice way to analyze the approximate behavior by assuming 
that u and v are large enough that a continuous distribution describes the ratio 
v/u, while k varies discretely. [See Algorithms and Complexity, edited by J. F. 
Traub (New York: Academic Press, 1976), 321-355.] Let us assume that u and v 
are large integers that are essentially random, except that they are odd and their 
ratio has a certain probability distribution. Then the least significant bits of the 
quantity t = u — v in step B6 will be essentially random, except that t will 
be even. Hence t will be an odd multiple of 2% with probability 2—"; this is 
the approximate probability that k right shifts will be needed in the subtract- 
and-shift cycle. In other words, we obtain a reasonable approximation to the 
behavior of Algorithm B if we assume that step B4 always branches to B3 with 
probability 1/2. 

Let G,,(x) be the probability that min(u, v)/max(u,v) is > x after n subtract- 
and-shift cycles have been performed under this assumption. If u > v and if 
exactly k right shifts are performed, the ratio X = v/u is changed to X’ = 
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min(2%v/(u — v), (u—v)/2*v) = min(2*X/(1 — X), (1 — X)/2*X). Thus we will 
have X’ > z if and only if 2*X/(1 — X) > z and (1 — X)/2*X > a; and this is 


the same as 
1 1 


——— < X < —_. 
LF ~~ > 14 2ke (34) 
Therefore Gn(x) satisfies the interesting recurrence 
1 1 
mat) = 02" (e(Te) ~ On (TF) 
Gayla) =) 2*(G IFs) ~ Ol IF Fa (35) 


k>1 


where Go(x) = 1 — x for 0 < x < 1. Computational experiments indicate that 
G,,(x) converges rapidly to a limiting distribution G(x) = G(x), although a 
formal proof of convergence seems to be difficult. We shall assume that G(x) 
exists; hence it satisfies 


Ga) = 52t (e(a) e(z) formes gH 


k>1 


G(0) = 1; G(1) = 0. (37) 


1 1 1 1 1 1 
ae) = 5¢(7) +30) +56(7) eG 


= eli ra) (38) 


k>1 


Let 


then we have 
G(x) = S(1/x) — S(x). (39) 
It is convenient to define 
G(1/x) = -G(2), (40) 
so that (39) holds for all x > 0. As x runs from 0 to oo, S(x) increases from 
0 to 1, hence G(x) decreases from +1 to —1. Of course G(x) is no longer a 
probability when « > 1; but it is meaningful nevertheless (see exercise 23). 
We will assume that there are power series a(x), 8(£), Ym(X), Om(x), A(x), 
L(x), Om(2), Tm(2), and p(x) such that 


Co 


G(x) = a(x) lg a + g(x) 5 (Ym(x) cos 2rmlg x + bm(x) sin2armlga), (41) 
m=1 

S(x) = A(x) lg a + p(x) 5 (om(x) cos 2amlg £ + Tm(£)sin 2rmlgr), (42) 
m=1 

p(x) = G(1 + £) = pix + pox + paz? + pax* + psx? + por? +++, (43) 


because it can be shown that the solutions G,,(x) to (35) have this property for 
n > 1. (See, for example, exercise 30.) The power series converge for |x| < 1. 
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Fig. 10. The limiting distribution of ratios in the binary gcd algorithm. 


What can we deduce about a(x), ..., p(x) from equations (36)—(43)? In the 
first place we have 
2S(x) = G(1/(1 + 2x)) + S(2x) = S(2x) — p(2x) (44) 
from (38), (40), and (43). Consequently Eq. (42) holds if and only if 
2X(x) = A(22); (45) 
2u(x) = w(2x) + A(2x) — p(2x); (46) 
20m(T) = Om(2z), 2Tm(T) = Tm(2z), for m > 1. (47) 


Relation (45) tells us that A(x) is simply a constant multiple of x; we will write 
A(x) = —Ax (48) 

because the constant is negative. (The relevant coefficient turns out to be 
A = 0.39792 26811 88316 64407 67071 61142 65498 23098-+, (49) 


but no easy way to compute it is known.) Relation (46) tells us that pı = —A, 
and that 2u = 2", —2*p, when k > 1; in other words, 


Hk = pk/(1- 27"), — for k > 2. (50) 
We also know from (47) that the two families of power series 
Om(XL) = Ome, TmT) = Tnt (51) 


are simply linear functions. (This is not true for ym(x) and ôm(£).) 
Replacing x by 1/2a in (44) yields 


25(1/2x) = S(1/x) + G(a/(14+ 2)), (52) 
and (39) converts this equation to a relation between G and S when z is near 0: 


2G (2x) + 2S(2x) = G(x) + S(x) + G(x/(1 + 2)). (53) 
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The coefficients of lg x must agree when both sides of this equation are expanded 
in power series, hence 


2a(2x) — 4% = a(x) — Az + a(a/(1+2)). (54) 


Equation (54) is a recurrence that defines a(x). In fact, let us consider the 
function ~(z) that satisfies 


Then (54) says that 


Moreover, iteration of (55) yields 


waar 1 es E ae )4 ) 
Oat aha m A\4A 442 4422 4432 


== x 2 ES (57) 


ae 0<j <24 


It follows that the power series expansion of w(z) is 


stp LS Bem) | bn 
(2) = 0-1)" ne", t= 5 TG) mts (58) 


n>1 


see exercise 27. This formula for Yn is surprisingly similar to an expression that 
arises in connection with digital search tree algorithms, Eq. 6.3-(18). Exercise 28 
proves that Yp = O(n-?). 

We now know a(x), except for the constant Aà = —p, and (50) relates 
u(x) to p(x) except for the coefficient u. The answer to exercise 25 shows 
that the coefficients of p(x) can all be expressed in terms of p1, (3, P5, ---3 
moreover, the constants om and Tm can be computed by the method used to 
solve exercise 29, and complicated relations also hold between the coefficients of 
the functions ym(x) and m(x). However, there seems to be no way to compute 
all the coefficients of the various functions that enter into G(x) except to iterate 
the recurrence (36) by elaborate numerical methods. 

Once we have computed a good approximation to G(x), we can estimate the 
asymptotic average running time of Algorithm B as follows: If u > v and if k 
right shifts are performed, the quantity Y = uv is changed to Y’ = (u — v)v/2*; 
hence the ratio Y/Y’ is 2*/(1—X), where X = v/u is > x with probability G(x). 
Therefore the number of bits in uv decreases on the average by the constant 


b = Elg(Y/Y’) =y F0 j+ f a file ) de), 


k>1 


352 ARITHMETIC 4.5.2 
where f,(x) = 1g(2*/(1 — x)); we have 


= 25 (z : [ iwi) a [ antes (59) 


k>1 


When eventually u = v, the expected value of lg uv will be approximately 0.9779 
(see exercise 14); therefore the total number of subtract-and-shift cycles of Algo- 
rithm B will be approximately 1/b times the initial value of lg uwv. By symmetry, 
this is about 2/b times the initial value of lg u. Numerical computations carried 
out by Richard Brent in 1997 give the value 


2/b = 0.70597 12461 01916 39152 93141 35852 88176 66677+ (60) 


for this fundamental constant. 
A deeper study of these functions by Brigitte Vallée led her to suspect that 
the constants A and b might be related by the remarkable formula 


À 2ln2 

p Ea (61) 
Sure enough, the values computed by Brent agree perfectly with this tantalizing 
conjecture. Vallée has successfully analyzed Algorithm B using rigorous “dy- 


namical” methods of great interest [see Algorithmica 22 (1998), 660-685]. 


Let us return to our assumption in (32) that u and v are odd and in the 
ranges 2™ < u < 2™+1 and 2” < v < 2"+!, Empirical tests of Algorithm B with 
several million random inputs and with various values of m and n in the range 
29 < m,n < 37 indicate that the actual average behavior of the algorithm is 
given by 

C ~ $m + 0.203n + 1.9 — 0.4(0.6)"—, 
m >n, 
Dx m+0.4in — 0.5 — 0.7 (0.6)"7”, 


with a rather small standard deviation from these observed average values. The 
coefficients 4 and 1 of m in (62) can be verified rigorously (see exercise 21). 

If we assume instead that u and v are to be any integers, independently and 
uniformly distributed over the ranges 


1<u<2”, 1<v< 2N, (63) 


(62) 


then we can calculate the average values of C and D from the data already given: 
C x 0.70N + O(1), D x 1.41N + O(1). (64) 


(See exercise 22.) This agrees perfectly with the results of further empirical tests, 
made on several million random inputs for N < 30; the latter tests show that 
we may take 

C = 0.70N — 0.5, D = 1.41 N — 2.7 (65) 


as decent estimates of the values, given this distribution of the inputs u and v. 

The theoretical analysis in Brent’s continuous model of Algorithm B predicts 
that C and D will be asymptotically equal to 2N/b and 4N/b under assump- 
tion (63), where 2/b ~ 0.70597 is the constant in (60). The agreement with 
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experiment is so good that Brent’s constant 2/b must be the true value of the 
number “0.70” in (65), and we should replace 0.203 by 0.206 in (62). 

This completes our study of the average values of C and D. The other three 
quantities that appear in the running time of Algorithm B are quite easy to 
analyze; see exercises 6, 7, and 8. 


Now that we know approximately how Algorithm B behaves on the average, 


let’s consider a “worst case” scenario: What values of u and v are in some sense 
the hardest to handle? If we assume as before that 


llgu| =m and llgv| =n, 


we want to find u and v that make the algorithm run most slowly. The subtrac- 
tions take somewhat longer than the shifts, when the auxiliary bookkeeping is 
considered, so this question may be rephrased by asking for the inputs u and v 
that require the most subtractions. The answer is somewhat surprising; the 
maximum value of C is exactly 


max(m,n) + 1, (66) 


although a naive analysis would predict that substantially higher values of C 
are possible (see exercise 35). The derivation of the worst case (66) is quite 
interesting, so it has been left as an amusing problem for readers to work out for 
themselves (see exercises 36 and 37). 


EXERCISES 
1. [M21] How can (8), (9), (10), (11), and (12) be derived easily from (6) and (7)? 
2. [M22] Given that u divides viv2...Un, prove that u divides 
ged(u, v1) gced(u, ve)... gcd(u, un). 
3. [M23] Show that the number of ordered pairs of positive integers (u,v) such that 
Icm(u,v) = n is the number of divisors of n’. 


4. [M21] Given positive integers u and v, show that there are divisors u’ of u and 
v’ of v such that u’ L v’ and wv’ = Iem(u, v). 


5. [M26] Invent an algorithm (analogous to Algorithm B) for calculating the greatest 
common divisor of two integers based on their balanced ternary representation. Dem- 
onstrate your algorithm by applying it to the calculation of gcd(40902, 24140). 


6. [M22] Given that u and v are random positive integers, find the mean and the 
standard deviation of the quantity A that enters into the timing of Program B. (This 
is the number of right shifts applied to both u and v during the preparatory phase.) 


M20] Analyze the quantity B that enters into the timing of Program B. 


Ta 

8. [M25] Show that in Program B, the average value of E is approximately equal to 
Cave, where Cave is the average value of C. 
9. 


18] Using Algorithm B and hand calculation, find gcd(31408, 2718). Also find 
integers m and n such that 31408m + 2718n = gcd(31408, 2718), using Algorithm X. 


NI 
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> 10. [HM24] Let qn be the number of ordered pairs of integers (u, v) lying in the range 
1 < u,v < n such that u L v. The object of this exercise is to prove that we have 
limn—oo Gn/ n? = 6/ n’, thereby establishing Theorem D. 


a) Use the principle of inclusion and exclusion (Section 1.3.3) to show that 


gn =n? -Y lnm]? + D> [n/prpe]? —---, 
Pı P1 <P2 
where the sums are taken over all prime numbers p;i. 
b) The Möbius function u(n) is defined by the rules u(1) = 1, u(pıp2... pr) = (—1)” 
if pı, p2, ..., Pr are distinct primes, and a(n) = 0 if n is divisible by the square of 
a prime. Show that qn = 37,5, u(k)|n/k]?. 
c) As a consequence of (b), prove that limn—oo gn/n? = ops, m(k)/k?. 
d) Prove that (37,5, u(k)/k?)(32,,5, 1/m?) = 1. Hint: When the series are abso- 
lutely convergent we have 


(Zar) (x bn/m*) = I(E etna) [nt 


k>1 m>1 n>1 \d\n 


11. [M22] What is the probability that gcd(u,v) < 3? (See Theorem D.) What is 
the average value of gcd(u, v)? 
12. [M24] (E. Cesaro.) If u and v are random positive integers, what is the aver- 
age number of (positive) divisors they have in common? [Hint: See the identity in 
exercise 10(d), with ax = bm = 11] 
13. [HM23] Given that u and v are random odd positive integers, show that they are 
relatively prime with probability 8/7”. 

> 14. [HM25] What is the expected value of Ingcd(u,v) when u and v are (a) random 
positive integers? (b) random positive odd integers? 
15. [M21] What are the values of vı and v2 when Algorithm X terminates? 

> 16. [M22] Design an algorithm to divide u by v modulo m, given positive integers u, 


v, and m, with v relatively prime to m. In other words, your algorithm should find w, 
in the range 0 < w < m, such that u = vw (modulo m). 


> 17. [M20] Given two integers u and v such that uv = 1 (modulo 2°), explain how to 
compute an integer u’ such that u’v = 1 (modulo 2?°). [This leads to a fast algorithm 
for computing the reciprocal of an odd number modulo a power of 2, since we can start 
with a table of all such reciprocals for e = 8 or e = 16.] 


> 18. [M24] Show how Algorithm L can be extended (as Algorithm A was extended to 
Algorithm X) to obtain solutions of (15) when u and v are large. 


19. [21] Use the text’s method to find a general solution in integers to the following 
sets of equations: 
a) 3a+7y+1lz=1 b) 3a+7y+11lz= 1 
5x +Ty— 52=3 5r +7y-— 5z= -3 


20. [M37] Let u and v be odd integers, independently and uniformly distributed in 
the ranges 2™ < u <2™t!, 2” <y < 2”+1, What is the exact probability that a single 
subtract-and-shift cycle in Algorithm B reduces u and v to the ranges 2™ < u < 2m’ +1, 
Qn’ <u < 2n'+1 as a function of m, n, m’, and n’? 
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21. [HM26] Let Cmn and Dmn be the average number of subtraction steps and shift 
steps, respectively, in Algorithm B, when u and v are odd, |lgu] = m, |lgv| = n. 
Show that for fixed n, Cmn = m+ O(1) and Dmn = m + O(1) as m > œœ. 
22. [M28] Continuing the previous exercise, show that if Cmn = am + Bn + y for 
some constants a, 8, and y, then 

So (N= m)(N =n) Cnn = 2°" (BZ (a + B)N + O(1)), 


1l<n<m<N 


D> (N -an = (E (a+ AN + O(1)). 


1<n<N 


> 23. [M20] What is the probability that v/u < x after n subtract-and-shift cycles of 
Algorithm B, when the algorithm begins with large random integers? (Here x is any 
real number > 0; we do not assume that u > v.) 


24. [M20] Suppose u > v in step B6, and assume that the ratio v/u has Brent’s 
limiting distribution G. What is the probability that u < v the next time step B6 is 
encountered? 


25. [M21] Equation (46) implies that pı = —A; prove that p2 = »/2. 
26. [M22] Prove that when G(x) satisfies (36)—(40) we have 


2G(x) — 5G(2x) + 2G(4x) = G(1 + 2x) — 2G(1 + 4x) + 2G(1 + 1/x) — G(1 + 1/22). 


27. [M22] Prove (58), which expresses n in terms of Bernoulli numbers. 
28. [HM36] Study the asymptotic behavior of Yn. Hint: See exercise 6.3-34. 


> 29. [HM26] (R. P. Brent.) Find Gi(«), the distribution of min(u, v)/max(u, v) after 
the first subtract-and-shift cycle of Algorithm B as defined in (35). Hint: Let S,41(x) = 
EL, 2G, (1/(1+ 2*x)), and use the method of Mellin transforms for harmonic sums 
[see P. Flajolet, X. Gourdon, and P. Dumas, Theor. Comp. Sci. 144 (1995), 3-58). 


30. [HM39| Continuing the previous exercise, determine G2(x). 
31. [HM46] Prove or disprove Vallée’s conjecture (61). 
32. [HM42] Is there a unique continuous function G(x) that satisfies (36) and (37)? 


9 


33. [M46] Analyze Harris’s “binary Euclidean algorithm,” stated after Program B. 


34. [HM49]| Find a rigorous proof that Brent’s model describes the asymptotic be- 
havior of Algorithm B. 
35. [M23] Consider a directed graph with vertices (m,n) for all nonnegative integers 
m,n > 0, having arcs from (m,n) to (m’,n’) whenever it is possible for a subtract-and- 
shift cycle of Algorithm B to transform integers u and v with |Igu| = m and |lgv| =n 
into integers u’ and v’ with [lg u| = m’ and |lgv'] =n’; there also is a special “Stop” 
vertex, with arcs from (n,n) to Stop for all n > 0. What is the length of the longest 
path from (m,n) to Stop? (This gives an upper bound on the maximum running time 
of Algorithm B.) 

> 36. [M28] Given m > n > 1, find values of u and v with |lgu| = m and |lgv| = n 
such that Algorithm B requires m + 1 subtraction steps. 
37. [M32] Prove that the subtraction step B6 of Algorithm B is never executed more 
than 1 + |lgmax(u, v)| times. 

> 38. [M32] (R. W. Gosper.) Demonstrate how to modify Algorithm B for large num- 
bers, using ideas analogous to those in Algorithm L. 
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> 39. [M28] (V. R. Pratt.) Extend Algorithm B to an Algorithm Y that is analogous 
to Algorithm X. 


> 40. [M25] (R. P. Brent and H. T. Kung.) The following variant of the binary gcd 
algorithm is better than Algorithm B from the standpoint of hardware implementation, 
because it does not require testing the sign of u — v. Assume that u is odd; u and v 
can be either positive or negative. 


K1. [Initialize.]| Set c + 0. (This counter estimates the difference between lg |u| 
and lg |v|.) 

K2. [Done?] If v = 0, terminate with |u| as the answer. 

K3. [Make v odd.] Set v + v/2 and c + c+ 1 zero or more times, until v is odd. 


K4. [Make c < 0.] If c > 0, interchange u 4 v and set c+ —c. 


K5. [Reduce.] Set w + (u+v)/2. If w is even, set v 4+ w; otherwise set v + w—v. 
Return to step K2. J 


Prove that step K2 is performed at most 2 + 21g max(|u|, |v|) times. 


41. [M22] Use Euclid’s algorithm to find a simple formula for gcd(10” — 1, 10” — 1) 
when m and n are nonnegative integers. 


42. [M30] Evaluate the determinant 


ged(1,1) ged(1,2) ... ged(1,n) 
ged(2,1) gced(2,2) ... ged(2,n) 
ged(n,1) gced(n,2) ... ged(n,n) 


*4.5.3. Analysis of Euclid’s Algorithm 


The execution time of Euclid’s algorithm depends on T, the number of times 
the division step A2 is performed. (See Algorithm 4.5.2A and Program 4.5.2A.) 
The quantity T is also an important factor in the running time of other algo- 
rithms, such as the evaluation of functions satisfying a reciprocity formula (see 
Section 3.3.3). We shall see in this section that the mathematical analysis of this 
quantity T is interesting and instructive. 


Relation to continued fractions. Euclid’s algorithm is intimately connected 
with continued fractions, which are expressions of the form 


= bı/(a1+b2/(a2+b3/( ++ /(an-1+bn/an)... )))- 


at 1m (1) 
oe 


Qn—1+— 
Continued fractions have a beautiful theory that is the subject of several classic 
books, such as O. Perron, Die Lehre von den Kettenbriichen, 3rd edition (Stutt- 
gart: Teubner, 1954), 2 volumes; A. Khinchin, Continued Fractions, translated by 
Peter Wynn (Groningen: P. Noordhoff, 1963); and H. S. Wall, Analytic Theory 
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of Continued Fractions (New York: Van Nostrand, 1948). See also Claude 
Brezinski, History of Continued Fractions and Padé Approximants (Springer, 
1991), for the early history of the subject. It is necessary to limit ourselves to 
a comparatively brief treatment of the theory here, studying only those aspects 
that give us more insight into the behavior of Euclid’s algorithm. 

The continued fractions of primary interest to us are those in which all of 
the b’s in (1) are equal to unity. For convenience in notation, let us define 


[/£1, 22,- --,En f = 1/ (£1 + 1/(£2 + 1/(-++/(@n-1 + 1/tn).--))). (2) 


Thus, for example, 


1 1 v2 
tiff = T [/x1,£2// = a + l/a = Tati (3) 
If n = 0, the symbol //z1,...,2,,// is taken to mean 0. Let us also define the 
so-called continuant polynomials Kn(z£1, £2,..., £n) of n variables, for n > 0, by 
the rule 
1, if n = 0; 
Kalt Eoss in) =< 21, ifn=1; (4) 


T1Kn-1(22,---, 2n) +Kn-2(23,---;2n), ifn>1. 


Thus Kə(z1, £2) = £1£2 + 1, Ka (£1, £2, £3) = L1L2£3 + £1 + z3, etc. In general, 

as noted by L. Euler in the eighteenth century, Kn(£1,£2,..., 2n) is the sum 

of all terms obtainable by starting with £1£2...&n and deleting zero or more 

nonoverlapping pairs of consecutive variables 7;7;+1; there are Fp+1 such terms. 
The basic property of continuants is the explicit formula 


POG Dein Ball = aA Poy say he) ol Oly Cin) n>1. (5) 
This can be proved by induction, since it implies that 
to + £1, . -En // = Kyn4i(@0,01,---,Un)/Kn(@1,---,2n)} 


hence //%9,21,...,%n// is the reciprocal of the latter quantity. 
The K-polynomials are symmetrical in the sense that 


Ky (01, £2,..., En) = Kn(@n,---, £2, £1). (6) 
This follows from Euler’s observation above, and as a consequence we have 
Kalieni Ta) = Un Kn-1(21,.--,2n-1) + Kn- (tipte tn) (7) 
for n > 1. The K-polynomials also satisfy the important identity 
Kn(£1,...,En)Kn(T2,..., £n41) — Kn4i(1,---,Un41) Kn-1(®2,---,2n) 
=(-1)", n>1. (8) 
(See exercise 4.) The latter equation in connection with (5) implies that 
pete E E E E 


gn 192 492493 Gaia 
where qk = Kp(£1,..., 2k). (9) 
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Thus the K-polynomials are intimately related to continued fractions. 
Every real number X in the range 0 < X < 1 has a regular continued fraction 
defined as follows: Let Xo = X, and for all n > 0 such that Xn 4 0 let 


An+1 = |1/Xn], Xn41 = 1/Xn a An41- (10) 


If X, = 0, the quantities Anyi and X,+41 are not defined, and the regular 
continued fraction for X is //Aj,...,An//. If Xn 40, this definition guarantees 
that 0 < Xn+1 < 1, so each of the A’s is a positive integer. Definition (10) also 
implies that 


1 1 
O Ape © AG AGA 
hence 
X = //Ay,..., An-1, An + Xn// (11) 


for all n > 1, whenever Xn is defined. In particular, we have X = //Aj,..., An// 
when Xn = 0. If X, 4 0, the number X always lies between //A1,...,An// 
and //Aj,..., An +1//, since by (7) the quantity qn = Ky(A1,..., An + Xn) 
increases monotonically from K,(A1,...,An) up to K,(A1,..., An +1) as Xn 
increases from 0 to 1, and by (9) the continued fraction increases or decreases 
when qn increases, according as n is even or odd. In fact, 


|X = /fA1,..-,An//| = YA,- -, An + Xn/f — Ar,- Anf 
= |/Ar;---, An 1/ Xah — Ar,- -Anf 
O | Kn(A2,..., An, 1/Xn)  Kn-1(å2,..., An) 
— |Kny1(41,..-, An, 1/Xn)  Ka(dıi,..., An) 
=) Bay Aig An) Kny (41; ---, An, 1/Xn)) 
< 1/(Kn(Aı,. Pà , An) Kn41(A1,. T , An, An+1)) (12) 


by (5), (7), (8), and (10). Therefore //A1,...,An// is an extremely close approx- 
imation to X, unless n is small. If X» is nonzero for all n, we obtain an infinite 
continued fraction //A1, A2, A3,... //, whose value is defined to be 


lim A1, A2,. E% , An /; 
n— o0 


from inequality (12) it is clear that this limit equals X. 

The regular continued fraction expansion of real numbers has several prop- 
erties analogous to the representation of numbers in the decimal system. If we 
use the formulas above to compute the regular continued fraction expansions of 
some familiar real numbers, we find, for example, that 


a = //3,1,1,1,2/; 
VŠ = 1,1, 9,2; 2,8, 2,2,9)1, 2,1, 9,2)2,3,2,2,9,1,2) 1,9, 2,2,3,2,2,9,1,... //; 
1+ //3,1,5,1,1,4,1,1,8,1,14,1,10,2,1,4,12,2,3,2,1,3,4,1,1,2,14,3,... //; 
m = 3 + //7,15,1,292,1,1,1,2,1,3,1,14,2,1,1,2,2,2,2,1,84,2,1,1,15,3,13,... //; 
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e=2 + //1,2,1,1,4,1,1,6,1,1,8, 1,1, 10,1,1,12,1,1,14,1,1,16,1,1,18,1,...//; 
y= //1,1,2,1,2, 1,4, 3, 13,5, 1,1,8,1,2,4, 1, 1,40, 1, 11,3, 7,1,7,1,1,5,1,49,...//; 
ġ=1+ //1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,... /. (23) 


The numbers A4, Ag, ... are called the partial quotients of X. Notice the regular 
pattern that appears in the partial quotients for ,/8/29, ¢, and e; the reasons for 
this behavior are discussed in exercises 12 and 16. There is no apparent pattern 
in the partial quotients for V2, 7, or y. 

It is interesting to note that the ancient Greeks’ first definition of real 
numbers, once they had discovered the existence of irrationals, was essentially 
stated in terms of infinite continued fractions. (Later they adopted the suggestion 
of Eudoxus that x = y should be defined instead as “x < r if and only if y < r, 
for all rational r.”) See O. Becker, Quellen und Studien zur Geschichte Math., 
Astron., Physik B2 (1933), 311-333. 


When X is a rational number, the regular continued fraction corresponds 
in a natural way to Euclid’s algorithm. Let us assume that X = v/u, where 
u >v > 0. The regular continued fraction process starts with Xo = X; let us 
define Up = u, Vo = v. Assuming that Xn = V,/Un 4 0, (10) becomes 


An+1 = lUn/ Vals Xn+1 = Un/Vn = An41 = (Un mod Va) Vas (14) 
Therefore, if we define 
Un+1 = Vn, Vn+1 = Un mod Vn, (15) 


the condition Xn = Vn/Un holds throughout the process. Furthermore, (15) is 
precisely the transformation made on the variables u and v in Euclid’s algorithm 
(see Algorithm 4.5.2A, step A2). For example, since = = //3,1,1,1,2//, we 
know that Euclid’s algorithm applied to u = 29 and v = 8 will require exactly 
five division steps, and the quotients |u/v| in step A2 will be successively 3, 1, 
1, 1, and 2. The last partial quotient A, must always be 2 or more when X,, = 0 
and n > 1, since X,_1 is less than unity. 

From this correspondence with Euclid’s algorithm we can see that the regular 
continued fraction for X terminates at some step with X» = 0 if and only if X 
is rational; for it is obvious that X,, cannot be zero if X is irrational, and, 
conversely, we know that Euclid’s algorithm always terminates. If the partial 
quotients obtained during Euclid’s algorithm are A1, A2, ..., An, then we have, 
by (5), 

U = Kn-1(Aa,.--, An) (16) 
u Kn( A1, A2,..., An) 
This formula holds also if Euclid’s algorithm is applied for u < v, when A, = 0. 
Furthermore, because of relation (8), the continuants Kn-1(A2,..., An) and 
Kn(A1, A2,..., An) are relatively prime, and the fraction on the right-hand side 
of (16) is in lowest terms; therefore 


u = Kn(A1, A2,..., An)d, v = Ky-1(Ag,..., An)d, (17) 
where d = gcd(u, v). 
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The worst case. We can now apply these observations to determine the 
behavior of Euclid’s algorithm in the worst case, or in other words to give an 
upper bound on the number of division steps. The worst case occurs when the 
inputs are consecutive Fibonacci numbers: 


Theorem F. For n > 1, let u and v be integers with u > v > 0 such that 
Euclid’s algorithm applied to u and v requires exactly n division steps, and such 
that u is as small as possible satisfying these conditions. Then u = Fn+2 and 
v= Fatt: 


Proof. By (17), we must have u = K, (A1, A2,..., An)d, where Ai, Ao, ..., 
An, and d are positive integers and A, > 2. Since Kn is a polynomial with 
nonnegative coefficients, involving all of the variables, the minimum value is 
achieved only when A; = 1, ..., An-1 = 1, An = 2, d = 1. Putting these values 
in (17) yields the desired result. J 


This theorem has the historical claim of being the first practical application 
of the Fibonacci sequence; since then many other applications of Fibonacci 
numbers to algorithms and to the study of algorithms have been discovered. The 
result is essentially due to T. F. de Lagny [Mém. Acad. Sci. 11 (Paris, 1733), 363- 
364], who tabulated the first several continuants and observed that Fibonacci 
numbers give the smallest numerator and denominator for continued fractions 
of a given length. He did not explicitly mention gcd calculation, however; the 
connection between Fibonacci numbers and Euclid’s algorithm was first pointed 
out by É. Léger [Correspondance Math. et Physique 9 (1837), 483-485.] 

Shortly afterwards, P. J. É. Finck [Traité Élémentaire d’Arithmétique (Stras- 
bourg: 1841), 44] proved by another method that gcd(u, v) takes at most 21g v+1 
steps, when u > v > 0; and G. Lamé [Comptes Rendus Acad. Sci. 19 (Paris, 
1844), 867-870] improved this to 5flog,)(v + 1)]. Full details about these 
pioneering studies in the analysis of algorithms appear in an interesting review 
by J. O. Shallit, Historia Mathematica 21 (1994), 401-419. A more precise 
estimate of the worst case is, however, a direct consequence of Theorem F: 


Corollary L. If 0 < v < N, the number of division steps required when 
Algorithm 4.5.2A is applied to u and v is at most [logg (3 — )N]. 


Proof. After step Al we have v > umodv. Therefore by Theorem F, the 
maximum number of steps, n, occurs when v = F,,,, and u mod v = Fy. Since 
Fri < N, we have 6"11//5 < N (see Eq. 1.2.8-(15)); thus 6” < (V5/¢)N = 
(3—9)N. | 

The quantity log, (3 — ¢)N is approximately equal to 2.078In N + .6723 ~ 
4.785 logy N + .6723. See exercises 31, 36, and 38 for extensions of Theorem F. 


An approximate model. Now that we know the maximum number of division 
steps that can occur, let us attempt to find the average number. Let T(m,n) 
be the number of division steps that occur when u = m and v = n are input to 
Euclid’s algorithm. Thus 


T(m,0) = 0; T(m,n) =1+T(n,m mod n) ifn > 1. (18) 
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Let Tn be the average number of division steps when v = n and when u is chosen 
at random; since only the value of u mod v affects the algorithm after the first 
division step, we have 
1 
O0<k<n 
For example, T(0,5) = 1, T(1,5) = 2, T(2,5) = 3, T(3,5) = 4, T(4,5) = 3, so 
Ts = (1+2+3+4+3) = 22. 
Our goal is to estimate Tn for large n. One idea is to try an approximation 


suggested by R. W. Floyd: We might assume that, for 0 < k < n, the value of n 
is essentially “random” modulo k, so that we can set 


Ty 1+ (Ty+ Ty +--+ Txt). 
Then Tn + Sn, where the sequence (Sn) is the solution to the recurrence relation 
So =0, Sy = 14 È (Sot Si ++ Sn), n>. (20) 
This recurrence is easy to solve by noting that 
icy ei E EE cate 


n+l1 


— (n(Sn =1) + Sn) = Cees 


= 1] + ; 
n+ n+1 


hence Sn is 1 + 4 5 t = Hp, a harmonic number. The approximation 
Tn ~ Sn now suggests that we might have T, ~ Inn + O(1). 

Comparison of this approximation with tables of the true value of Ta show, 
however, that Inn is too large; Tan does not grow this fast. Our tentative 
assumption that n is random modulo k must therefore be too pessimistic. And 
indeed, a closer look shows that the average value of n mod k is less than the 
average value of ik, in the range 1 < k < n: 

1 E (nmodk) == E (n-a) [Ln/(g+1)] <k < [n/a] 


n n 
l<k<n 1<k,q<n 


I 
3 
slr 
M 


2 2 


1 3 Ga 


z (1 = =) n + O(log n) (21) 


(see exercise 4.5.2-10(c)). This is only about .1775n, not .25n; so the value of 
n mod k tends to be smaller than Floyd’s model predicts, and Euclid’s algorithm 
works faster than we might expect. 


F (oe a 7 Gas 1)| E 
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A continuous model. The behavior of Euclid’s algorithm with v = N is 
essentially determined by the behavior of the regular continued fraction process 
when X = 0/N,1/N,..., (N—1)/N. When N is very large, we therefore want to 
study the behavior of regular continued fractions when X is essentially a random 
real number, uniformly distributed in [0..1). Consider the distribution function 


F(x) = Pr(Xn < 2), forO <a <1, (22) 


given a uniform distribution of X = Xo. By the definition of regular continued 
fractions, we have Fo(x) = x, and 


Fuii(a =Ņ_Pr(k < 1/Xn <k+2) 
k>1 
= > Pr(1/(k + £) < Xn < 1/k) 
k>1 
= 37 (Fa(1/k) — Fa(1/(k + 2). (23) 
k>1 
If the distributions Fo(x), Fı(x), ... defined by these formulas approach a 
limiting distribution F,,(x) = F(x), we will have 
F(x) = X (F(1/k) — F(1/(k+2))). (24) 
k>1 


(An analogous relation, 4.5.2-(36), arose in our study of the binary gcd algo- 
rithm.) One function that satisfies (24) is F(x) = log,(1 + x), for any base b > 1; 
see exercise 19. The further condition F(1) = 1 implies that we should take 
b = 2. Thus it is reasonable to make a guess that F(x) = lg(1+ x), and that 
F„(x) approaches this behavior. 

We might conjecture, for example, that F(5) = lg(3) ~ 0.58496; let us see 
how close F;,(4) comes to this value for small n. We ie Fo($) = 0.50000, and 


n= D(a) =a 


k>1 


F,($) = Hijo = 2 — 2n 2 ~ 0.61371; 
F2(5) = Hə;2 — H273 + Hoya — Ho/5 + Ho/6 — Hoj +++ 
(See Table 3 of Appendix A.) The power series expansion 
Hy = ¢(2)x —¢(3)a* + ¢(4)a® —C(5)a* ++ (25) 
makes it feasible to compute the numerical value 
F2($) = 0.57655 93276 99914 08418 82618 72122 27055 92452 — (26) 


We’re getting closer to 0.58496; but it is not immediately clear how to get a good 
estimate of Fa(4 ) for n = 3, mih less for really large values of n. 
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The distributions F„(x) were first studied by C. F. Gauss, who first thought 
of the problem on the 5th day of February in 1799. His notebook for 1800 
lists various recurrence relations and gives a brief table of values, including the 
(inaccurate) approximation F2(3) ~ 0.5748. After performing these calculations, 
Gauss wrote, “Tam complicatee evadunt, ut nulla spes superesse videatur”; i.e., 
“They come out so complicated that no hope appears to be left.” Twelve years 
later, he wrote a letter to Laplace in which he posed the problem as one he could 
not resolve to his satisfaction. He said, “I found by very simple reasoning that, 
for n infinite, Fa(x) = log(1 + x)/log2. But the efforts that I made since then 
in my inquiries to assign Fa (x) — log(1 + x)/log 2 for very large but not infinite 
values of n were fruitless.” He never published his “very simple reasoning,” 
and it is not completely clear that he had found a rigorous proof. [See Gauss’s 
Werke, vol. 101, 552-556.] More than 100 years went by before a proof was finally 
published, by R. O. Kuz’min [Atti del Congresso Internazionale dei Matematici 
6 (Bologna, 1928), 83-89], who showed that 


F(a“) = lg(1 + x) + Oar") 


for some positive constant A. The error term was improved to O(e~4”) by Paul 
Lévy shortly afterwards [Bull. Soc. Math. de France 57 (1929), 178-194]*; but 
Gauss’s problem, namely to find the asymptotic behavior of F (x) — lg(1 + x), 
was not really resolved until 1974, when Eduard Wirsing published a beautiful 
analysis of the situation [Acta Arithmetica 24 (1974), 507-528]. We shall study 
the simplest aspects of Wirsing’s approach here, since his method is an instructive 
use of linear operators. 

If G is any function of x defined for 0 < x < 1, let SG be the function 


defined by 
seis =¥(o(2)-o( 15), er 


k>1 


Thus, S is an operator that changes one function into another. In particular, 
by (23) we have Fn+1ı(£) = S Fn (x), hence 


F, = S" Fy. (28) 


(In this discussion F, stands for a distribution function, not for a Fibonacci 
number.) Notice that S is a “linear operator”; that is, S(cG) = c(SG) for all 
constants c, and S(G1 + Gz) = SGi + SG. 

Now if G has a bounded first derivative, we can differentiate (27) term by 
term to show that 


EaD grt (Ga) (29) 


k>1 


hence SG also has a bounded first derivative. (Term-by-term differentiation 
of a convergent series is justified when the series of derivatives is uniformly 


* An exposition of Lévy’s interesting proof appeared in the first edition of this book. 


364 ARITHMETIC 4.5.3 


convergent; see, for example, K. Knopp, Theory and Application of Infinite 
Series (Glasgow: Blackie, 1951), §47.) 
Let H = SG, and let g(x) = (1 + x)G@' (x), h(x) = (1 + x) H’ (x). It follows 


that ; 
1l+2 1 = 1 
Mo) = Eara) lesa) 
(2) Wo pa Neya 
k>1 
-5 ( k =) of 1 ) 
as k+lt+a k+r k+2z 


In other words, h = Tg, where T is the linear operator defined by 


Toe) =D (prira van) (ges): (30) 


Continuing, we see that if g has a bounded first derivative, we can differen- 
tiate term by term to show that Tg does also: 


aa'e)=-¥ (rF aap) as) 
ae n Hg —) (k Lar (z)) 


(epee (ol) aaa) 


k>1 


EAEN o(s) ` 


There is consequently a third linear operator, U, such that (Tg) = —U(g'), 
namely 
k nem l+a 1 
vee) =a S a O d elz) 
2 (A+14+2)? Ji/k+14+2) (k+a)3(kK4+14+2)"\k+a 
(31) 
What is the relevance of all this to our problem? Well, if we set 
F, (2) = Ig(1 + £) + Ra(Ig(1+2)), (32) 
1 
fn(v) = (1 + a) Fy(a) = 75 (1+ Ra C80 + 2))), (33) 
we have 
f(z) = Ra (Ig(1 + 2))/((m2)?(1 + 2)); (34) 


the effect of the lg(1+ x) term disappears, after these transformations. Further- 
more, since Fa = S” Fo, we have fa = T” fo and fi = (-1)"U" fj. Both Fn 
and fn have bounded derivatives, by induction on n. Thus (34) becomes 


CDR (Ig(1 + 2)) = (1 + 2)(In 2)? U” fo(2). (35) 
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Now Fols) = x, fo(x) = 1 + z, and f(x) is the constant function 1. We are 
going to show that the operator U” takes the constant function into a function 
with very small values, hence |R (x)| must be very small for 0 < x < 1. Finally 
we can clinch the argument by showing that R,,(a) itself is small: Since we have 
R,(0) = R,(1) = 0, it follows from a well-known interpolation formula (see 
exercise 4.6.4-15 with zo = 0, 41 = £, £2 = 1) that 


x(1— zx) 


Rala) = - = Re (En()) (36) 


for some function n(x), where 0 < n(x) < 1 when 0 <z<1. 

Thus everything hinges on our being able to prove that U” produces small 
function values, where U is the linear operator defined in (31). Notice that U is 
a positive operator, in the sense that Uy(x) > 0 for all x if y(x) > 0 for all z. 
It follows that U is order-preserving: If pı(x) < p2(x) for all x then we have 
Ugi(x) < Up2(x) for all x. 

One way to exploit this property is to find a function ọ for which we can 
calculate Uy exactly, and to use constant multiples of this function to bound the 
ones that we are really interested in. First let us look for a function g such that 
Tg is easy to compute. If we consider functions defined for all x > 0, instead of 
only on [0..1], it is easy to remove the summation from (27) by observing that 


SG +1) - SGi) = G(-—) - Jim G(—) =ef) =O): (37) 


when G is continuous. Since T((1 + 2)G’) = (1+ 2)(SG)’, it follows (see 
exercise 20) that 
Tg(x Tg(l +x 1 1 1 
(x) (l+z) _ ( ) of i) (38) 
1l+a 2+2 lrg 242 l+<2 
If we set Tg(x) = 1/(1+ x), we find that the corresponding value of g(x) is 
1+2-1/(1+2). Let p(x) = g'(x) = 141/(1+2)?, so that Up(x) = —(Tg)/(x) = 
1/(1 +2); this is the function y we have been looking for. 
For this choice of y we have 2 < v(x) /Uy(x) = (1+ 2)2+1 <5 for0<a2 <1, 
hence 


gp < Up < ie 
By the positivity of U and y we can apply U to this inequality again, obtaining 
+o < zUp < U?y < sUy < io; and after n — 1 applications we have 


57y < Ug sap (39) 


for this particular y. Let x(x) = f(x) = 1 be the constant function; then for 
0< x< 1 we have Š% < p < 2x, hence 


B5-My < 457o < GU" < UY < $U”y < 32" < 27y. 
It follows by (35) that 
3(In2)75-" < (—1)" R} (x) < 8(In2)?2-, for 0< 2 <1; 


hence by (32) and (36) we have proved the following result: 
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Theorem W. The distribution F,,(x) equals lg(1 + x) + O(27”) as n > oo. 
In fact, F,(x) — lg(1+ x) lies between 3,(—1)"*15~-" (In(1 + 2)) (In2/(1 + 2)) 
and $(—1)"*12-"(In(1 + 2))(In2/(1+2)), fr0 <£ <1. J 


With a slightly different choice of y, we can obtain tighter bounds (see 
exercise 21). In fact, Wirsing went much further in his paper, proving that 


F,,(x) = Ig(1 + x) + (~A) P(x) + O(a(1 — 2)(A — 0.031)”), (40) 


where 
A = 0.30366 30028 98732 65859 74481 21901 55623 31109— 


1 
= HGS. 99: S15 0194 19.1, 9.8.9 1118, 1 can // (41) 


is a fundamental constant (apparently unrelated to more familiar constants), and 
where W is an interesting function that is analytic in the entire complex plane 
except for the negative real axis from —1 to —oo. Wirsing’s function satisfies 
w(0) = W(1) = 0, Y'(0) < 0, and SW = —)V; thus by (37) it satisfies the identity 


W(2)-¥(2+1)=50(—-). E 


Furthermore, Wirsing demonstrated that 
v(-* + x) =cd "log N + O(1) as N > oo, (43) 
v 


where c is a constant and n = T (u,v) is the number of iterations when Euclid’s 
algorithm is applied to the integers u > v > 0. 

A complete solution to Gauss’s problem was found a few years later by K. I. 
Babenko [Doklady Akad. Nauk SSSR 238 (1978), 1021-1024], who used powerful 
techniques of functional analysis to prove that 


F(x) = lg(1 + x) + $ A7 Yj (2) (44) 


j22 


for al 0 < x < 1, n > 1. Here |Ag| > |A3] > JA] > ---, and each W;(z) 
is an analytic function in the complex plane except for a cut at [—oo.. — 1]. 
The function YW is Wirsing’s YW, and Az = —A, while A3 œ~ 0.10088, A, = 
—0.03550, A5 œ 0.01284, Ag œ~ —0.00472, A7 œ~ 0.00175. Babenko also es- 
tablished further properties of the eigenvalues Aj, proving in particular that 
they are exponentially small as j — co, and that the sum for j > k in (44) is 
bounded by (77/6) |A;|"~+ min(z, 1— x). [Further information appears in papers 
by Babenko and Yuriev, Doklady Akad. Nauk SSSR 240 (1978), 1273-1276; 
Mayer and Roepstorff, J. Statistical Physics 47 (1987), 149-171; 50 (1988), 331- 
344; D. Hensley, J. Number Theory 49 (1994), 142-182; Daudé, Flajolet, and 
Vallée, Combinatorics, Probability and Computing 6 (1997), 397-433; Flajolet 
and Vallée, Theoretical Comp. Sci. 194 (1998), 1-34.] The 40-place value of ÀA 
in (41) was computed by John Hershberger. 
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From continuous to discrete. We have now derived results about the prob- 
ability distributions for continued fractions when X is a real number uniformly 
distributed in the interval [0..1). But a real number is rational with probability 
zero— almost all numbers are irrational— so these results do not apply directly 
to Euclid’s algorithm. Before we can apply Theorem W to our problem, some 
technicalities must be overcome. Consider the following observation based on 
elementary measure theory: 


Lemma M. Let h, Iz, ..., Ji, Jo, ... be pairwise disjoint intervals contained 
in the interval [0..1), and let 


T=|]in £=l)%  K=][0..]\ (ZUJ). 


k>1 k21 
Assume that K has measure zero. Let P,, be the set {0/n, 1/n, ..., (n —1)/n}. 
Then ZA P,| 
l N Ph 
lim = p(T). (45) 
n— o0 n 


Here u(T) is the Lebesgue measure of Z, namely, X- ,>; length(J;,); and |Z N Pal 
denotes the number of elements in the set ZN Pa. 
Proof. Let In =, <,en Ik and In = Ui<gcy Jk. Given e > 0, find N large 
enough so that (Zn) + (Jn) > 1-— e, and let 
Rea Oi | r ul te 
k>N k>N 
If J is an interval, having any of the forms (a..6) or [a..b) or (a..b] or [a.. 0d], 
it is clear that u(I) = b — a and 
np(I) -1< |IN Pal < nul) +1. 
Now let rn = |In A Prl, $n = |In A Prl, tn = |Kn A Prl; we have 
Tn FSn F tn = i; 
nulIn)—N < rn < nu(In)+ N; 
nu In) -N < sn < nu(In) +N. 
Furthermore r, < |Z A Pa| < rn + tn, because Zy C T C Ty UK. Consequently 


N Tn ntti 
(T) — — —€ < p(Zn) <—< 
n n n 
n N N 
=1- 2 <1-p(Iw)+— < a 
m n nm 


Given ¢, this holds for all n; so limp. Tn/n = liMn+oo(Tn + tn)/n = u(T). 1 


Exercise 25 shows that Lemma M is not trivial, in the sense that some rather 
restrictive hypotheses are needed to prove (45). 


Distribution of partial quotients. Now we put Theorem W and Lemma M 
together to derive some solid facts about Euclid’s algorithm. 
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Theorem E. Let px,(a,n) be the probability that the (k + 1)st quotient Ay44 


in Euclid’s algorithm is equal to a, when u = n and when v is equally likely to 
be any of the numbers {0,1,...,n — 1}. Then 


1 1 
li =F =) 
lim pe(a,n) e(-) (=) 


where F(x) is the distribution function (22). 


Proof. The set Z of all X in [0..1) for which A,41 = a is a union of disjoint 
intervals, and so is the set J of all X for which Akı #4 a. Lemma M therefore 
applies, with K the set of all X for which Aķk+1 is undefined. Furthermore, 
Fi,(1/a) — Fy (1/(a + 1)) is the probability that 1/(a+ 1) < Xx < 1/a, which is 
u(T), the probability that Api, =a. I 


As a consequence of Theorems E and W, we can say that a quotient equal 
to a occurs with the approximate probability 


Ig(1 + 1/a) —Ig(1 + 1/(a + 1)) =1g((a + 1)7/((a+ 1)? — 1). 


Thus 


a quotient of 1 occurs about lg x~ 41.504 percent of the time; 


(3) 
a quotient of 2 occurs about lg(=) =~ 16.993 percent of the time; 
( 6 


a quotient of 3 occurs about lg is) x 9.311 percent of the time; 
a quotient of 4 occurs about lg(3?) ~ 5.889 percent of the time. 


N eje lo wh 


Actually, if Euclid’s algorithm produces the quotients A,, A2, ..., A the 
nature of the proofs above will guarantee this behavior only for A, when k is 
comparatively small with respect to t; the values A;_1, Ay_2, ... are not covered 
by this proof. But we can in fact show that the distribution of the last quotients 
At—-1, At—2, ... is essentially the same as the first. 

For example, consider the regular continued fraction expansions for the set 
of all proper fractions whose denominator is 29: 


35 = //29// 3 = /3,1,1,1,24 = //1,1,14/ 35 = 1,3,7/ 
z = 14,2) a = //3,4,2// 35 = //1,1,4,3/ 55 = //1,3,1,5/ 
æ = 19,1,2/ 95 = /2,1,9/ z5 = H/1,1,2,2,2 35 = 1/1,4,1,4/ 
5 = //7,4/ H = //2,1,1,1,3/ Æ = //1,1,1,1,1,3/ Ž = /1,6,4// 
39 = H5,1,4 3 = 1/2,2,2,2// 55 = //1,1,1,9/ 59 = //1,8,1,2// 
a = /4,1,5// 39 = //2,4,3/ 55 = //1,2,4,2// 55 = //1,18,2/ 
m= 4,7 95 = 1/2,14// 55 = #/1,2,1,1,1,24 5 = //1,28/ 


Several things can be observed in this table. 


x 


a) As mentioned earlier, the last quotient is always 2 or more. Furthermore, 
we have the obvious identity 


Hr, +++; 2n—1, Tn $1f/ = f/a1,..-,2n-1,2n,1//, (46) 
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which shows how continued fractions whose last quotient is unity are related to 
regular continued fractions. 


b) The values in the right-hand columns have a simple relationship to the values 
in the left-hand columns; can the reader see the correspondence before reading 
any further? The relevant identity is 


1— //v1,%2,...,¢n// = /[/1, £1 — 1, t2,..., En /; (47) 


see exercise 9. 


c) There is symmetry between left and right in the first two columns: If 
//Ay, A2, ..., At // occurs, so does //Ar,..., A2, A1 //. This will always be the 
case (see exercise 26). 


d) If we examine all of the quotients in the table, we find that there are 96 in 
all, of which 2% ~ 40.6 percent are equal to 1, 34 ~ 21.9 percent are equal to 2, 
8 


9g © 8-3 percent are equal to 3; this agrees reasonably well with the probabilities 


listed above. 


The number of division steps. Let us now return to our original problem and 
investigate T;,, the average number of division steps when v = n. (See Eq. (19).) 
Here are some sample values of Th: 


n= 95 96 97 98 99 100 101 102 103 104 105 
Ta = 50 44 53 48 4.7 46 53 46 53 47 4.6 


n= 996 997 998 999 1000 1001 --- 9999 10000 10001 
Ta= 65 7.3 7.0 68 64 6.7 +--+ 86 8.3 9.1 

n= 49998 49999 50000 50001 --- 99999 100000 100001 
In= 9.8 10.6 9.7 10.0 +- 10.7 10.3 11.0 


Notice the somewhat erratic behavior; Ta tends to be larger than its neighbors 
when n is prime, and it is correspondingly lower when n has many divisors. (In 
this list, 97, 101, 103, 997, and 49999 are primes; 10001 = 73 - 137; 49998 = 
2-3-13-641; 50001 = 3-7-2381; 99999 = 3-3-41-271; and 100001 = 11-9091.) 
It is not difficult to understand why this happens: If gcd(u,v) = d, Euclid’s 
algorithm applied to u and v behaves essentially the same as if it were applied to 
u/dand v/d. Therefore, when v = n has several divisors, there are many choices 
of u for which n behaves as if it were smaller. 

Accordingly let us consider another quantity, Tn, which is the average num- 
ber of division steps when v = n and when u is relatively prime to n. Thus 


1 
HY a, 
min 
It follows that 
1 
Tr = — 9) 9d) ta (49) 
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Here is a table of T, for the same values of n considered above: 


n= 95 96 97 98 99 100 101 102 103 104 105 
Tn = 94 53 53 56 52 5.2 54 53 54 5.3 5.6 


n= 996 997 998 999 1000 1001 --- 9999 10000 10001 
m= 72 73 73 73 73 TA +++ 921 9.21 9.22 
n= 49998 49999 50000 50001 --- 99999 100000 100001 
Tn = 10.59 10.58 10.57 10.59 --- 11.170 11.172 11.172 


Clearly Tmn is much more well-behaved than Tn, and it should be more susceptible 
to analysis. Inspection of a table of 7, for small n reveals some curious anomalies; 
for example, Tso = Tioo and Teo = Tı20. But as n grows, the values of Tn 
behave quite regularly indeed, as the table indicates, and they show no significant 
relation to the factorization properties of n. If these values 7, are plotted as 
functions of Inn on graph paper, for the values of Tn given above, they lie very 
nearly on the straight line 


Tn © 0.843 Inn + 1.47. (50) 


We can account for this behavior if we study the regular continued fraction 
process a little further. In Euclid’s algorithm as expressed in (15) we have 
Vo Vi  Vi-a Vea 
Uo Ui Uii Uo ’ 


since Ukķ}1 = Vg; therefore if U = Up and V = \ are relatively prime, and if 
there are t division steps, we have 


XoX1...X¢-1 = 1/U. 
Setting U = N and V = m < N, we find that 
ln Xo +ln Xi +- +mnX-1i=-hN. (51) 


We know the approximate distribution of Xo, X1, X2, ..., so we can use this 
equation to estimate 


t=T(N,m)=T(m,N)-1. 


Returning to the formulas preceding Theorem W, we find that the average 
value of In Xn, when Xo is a real number uniformly distributed in [0..1), is 


i ln z Fi (x) dx = 1 lng fn(x)dx/(1+ z), (52) 
0 0 
where f,,(x) is defined in (33). Now 

fala) = $ +007"), (53) 


~ In2 
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using the facts we have derived earlier (see exercise 23); hence the average value 


of In Xn is very well approximated by 


1 1 ling 1 © ue™” 
dz = d 
z 1+ T ns | l+e-¥ = 


1 CO 
= —- 5 (yf ue *“ du 
In2 0 
1 (1 1 1 1 1 ) 
— h2 4'9 16. 25 
1 1 1 1 1 1 
= (14 | 2/ eee area 
In2 4'9 4 16 36 
Se G+ pag ) 
~ 2m2 4'9 
= —77/(121n 2). 


By (51) we therefore expect to have the approximate formula 


—tr?/(12ln 2) ~ — ln N; 


that is, t should be approximately equal to ((121In2)/m?)In N. This constant 
(121n2)/7? = 0.842765913... agrees perfectly with the empirical formula (50) 


obtained earlier, so we have good reason to believe that the formula 


12In2 
~ f Inn+147 
Tv 


Tn 


indicates the true asymptotic behavior of T, as n —> oo. 
If we assume that (54) is valid, we obtain the formula 


12In2 A(d)\ 
Ty % a (inn 5 d ) + 1.47, 


d\n 


where A(d) is von Mangoldt’s function defined by the rules 
s Pi if n = p” for p prime and r > 1; 


0, otherwise. 


(See exercise 27.) For example, 

121n2 In2 In2 Ind a 
n’ 2 4 5 25 

~ (0.843)(4.605 — 0.347 — 0.173 — 0.322 — 0.064) + 1.47 

x 4.59; 


+ 1.47 


Tioo © 


(m 100 


the exact value of Tioo is 4.56. 
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We can also estimate the average number of division steps when u and v are 
both uniformly distributed between 1 and N, by calculating 


o 2 1o 1 
m=1 n=1 n=1 


Assuming formula (55), exercise 29 shows that this sum has the form 
121n2 


12 


InN + O(1), (58) 


and empirical calculations with the same numbers used to derive Eq. 4.5.2—(65) 
show good agreement with the formula 


12ln 2 
qe 


In N + 0.06. (59) 


Of course we have not yet proved anything about Tan and 7, in general; so far 
we have only been considering plausible reasons why certain formulas ought to 
hold. Fortunately it is now possible to supply rigorous proofs, based on a careful 
analysis by several mathematicians. 

The leading coefficient 127~? In 2 in the formulas above was established first, 
in independent studies by Gustav Lochs, John D. Dixon, and Hans A. Heilbronn. 
Lochs [Monatshefte für Math. 65 (1961), 27-52] derived a formula equivalent to 
the fact that (57) equals (127-7 n2) In N + a+ O(N-‘/?), where a ~ 0.065. 
Unfortunately his paper remained essentially unknown for many years, perhaps 
because it computed only an average value from which we cannot derive definite 
information about Tn for any particular n. Dixon |J. Number Theory 2 (1970), 
414-422] developed the theory of the F;,(x) distributions to show that individual 
partial quotients are essentially independent of each other in an appropriate 
sense, and proved that for all positive € we have |T(m,n) — (1277? In2) Inn| < 
(Inn)@/2)*¢ except for exp(—c(e)(log N)*/?) N? values of m and n in the range 
1<m<n<N, where c(e) > 0. Heilbronn’s approach was completely different, 
working entirely with integers instead of continuous variables. His idea, which is 
presented in slightly modified form in exercises 33 and 34, is based on the fact 
that 7, can be related to the number of ways to represent n in a certain manner. 
Furthermore, his paper [Number Theory and Analysis, edited by Paul Turán 
(New York: Plenum, 1969), 87-96] shows that the distribution of individual 
partial quotients 1, 2, ... that we have discussed above actually applies to the 
entire collection of partial quotients belonging to the fractions having a given 
denominator; this is a sharper form of Theorem E. A still sharper result was 
obtained several years later by J. W. Porter [Mathematika 22 (1975), 20-28], 
who established that 

121n2 


—1/6+e 
m= ant C + O(n NS), (60) 
where C ~ 1.46707 80794 is the constant 
In2 24 1 
a (3 In2+4y — Zeo) 2) ; (61) 
7 T2 2 
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see D. E. Knuth, Computers and Math. with Applic. 2 (1976), 137-139. Thus 
the conjecture (50) is fully proved. Using (60), Graham H. Norton [J. Sym- 
bolic Computation 10 (1990), 53-58] extended the calculations of exercise 29 to 
confirm Lochs’s work, proving that the empirical constant 0.06 in (59) is actually 


6ln2 12 
= (3In2 + 4y -zC (2) 3) — 1 = 0.06535 14259... . (62) 


D. Hensley proved in J. Number Theory 49 (1994), 142-182, that the variance 
of Tn is proportional to logn. 


The average running time for Euclid’s algorithm on multiple-precision inte- 
gers, using classical algorithms for arithmetic, was shown to be of order 


(1 + log(max(u, v) /ged(u, v))) log min(u, v) (63) 
by G. E. Collins, in SICOMP 3 (1974), 1-10. 


Summary. We have found that the worst case of Euclid’s algorithm occurs 
when its inputs u and v are consecutive Fibonacci numbers (Theorem F); the 
number of division steps when 0 < v < N will never exceed [4.8 log,, N — 0.32]. 
We have determined the frequency of the values of various partial quotients, 
showing, for example, that the division step finds |u/v| = 1 about 41 percent of 
the time (Theorem E). And, finally, the theorems of Heilbronn and Porter prove 
that the average number T, of division steps when v = n is approximately 


((12 In 2)/m?) Inn ~ 1.9405 logy n, 


minus a correction term based on the divisors of n as shown in Eq. (55). 


EXERCISES 


1. [20] Since the quotient |u/v| is equal to unity more than 40 percent of the time 
in Algorithm 4.5.2A, it may be advantageous on some computers to make a test for 
this case and to avoid the division when the quotient is unity. Is the following MIX 
program for Euclid’s algorithm more efficient than Program 4.5.2A? 


LDX U rX<¢u. SRAX 5 rAX<¢rA. 
JMP 2F JL 2F Isu—v<v? 
1H STX V verxX DIV V rX¢rAXmodv. 
SUB V rAfu-v. 2H LDA V rA¢uv. 
CMPA V JXNZ 1B DoneifrX=0. J 
: zm 1 rw T in vik 
2. [M21] Evaluate the matrix product ae : 
1 0 1 0 1 0 
Tl 1 0 0 
—1 T2 1 0 
3. [M21] What is the value of det 0 -=i v3 1 : 1? 
<a 1 
0 0 —1 Tri 


4. [M20] Prove Eq. (8). 
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5. [HM25] Let 1, x2, ... be a sequence of real numbers that are each greater than 
some positive real number e. Prove that the infinite continued fraction //x1, £2, ... / = 
limn—oo //@1,...,2n// exists. Show also that //x1,22,...// need not exist if we assume 
only that x; > 0 for all j. 

6. [M23] Prove that the regular continued fraction expansion of a number is unique 
in the following sense: If Bi, B2, ... are positive integers, then the infinite continued 
fraction //B,,B2,...// is an irrational number X between 0 and 1 whose regular 
continued fraction has An = Bn for all n > 1; and if Bi, ..., Bm are positive integers 
with Bm > 1, then the regular continued fraction for X = //Bi,...,Bm// has An = Bn 
forl<n<m. 

7. [M26] Find all permutations p(1)p(2)...p(m) of the integers {1,2,...,n} such 
that Kn (21, £2,-..,0n) = Kn(@pci), Zp(2), «++ Zp(n)) is an identity for all x1, x2, ..., In. 

8. [M20] Show that —1/Xn = //An,...,A1,—X//, whenever Xn is defined, in the 
regular continued fraction process. 

9. [M21] Show that continued fractions satisfy the following identities: 

a) f/ai,...,tnff = fri,..., ee + /trqi,---;n// M, l<k<n; 

b) 0, £1, £2,- --, En / = £1 + //x2,...,¢n//, m> 

C) //21,: . -,Ek—1, Ek, Ek+1,Ek+2: -En // = [/21,. ..;Ek—-1;Ek + Tk+1,Ek+2: -En //, 

1<k<n; 

d) 1— //a1,%2,...,2njf = f/1,"1—1,22,...,anff, n> 1. 

10. [M28] By the result of exercise 6, every irrational real number X has a unique 
regular continued fraction representation of the form 


X = Ao + //Ai, A2, As,... //, 


where Ao is an integer and Ai, A2, A3, ... are positive integers. Show that if X has 
this representation then the regular continued fraction for 1/X is 


1/X = Bo + //Bi,...,Bm, As, A6, --. // 


for suitable integers Bo, Bi, ..., Bm. (The case Ao < 0 is, of course, the most 
interesting.) Explain how to determine the B’s in terms of Ao, Ai, A2, A3, and Aa. 
11. [M30] (J.-A. Serret, 1850.) Let X = Ao + |/A1, A2, As, Aa, a .// and Y = Bo + 
// Bi, B2, B3, Ba,...// be the regular continued fraction representations of two real 
numbers X and Y, in the sense of exercise 10. Show that these representations 
“eventually agree,” in the sense that Am+, = Bn+x for some m and n and for all 
k > 0, if and only if we have X = (qY + r)/(sY + t) for some integers q, r, s, t with 
|qt —rs| = 1. (This theorem is the analog, for continued fraction representations, of 
the simple result that the representations of X and Y in the decimal system eventually 
agree if and only if X = (107Y + r)/10° for some integers q, r, and s.) 


12. [M30] A quadratic irrationality is a number of the form (WD — U)/V, where D, 
U, and V are integers, D > 0, V Æ 0, and D is not a perfect square. We may assume 
without loss of generality that V is a divisor of D — U?, for otherwise the number may 


be rewritten as (VDV? — U|V|)/(V|VJ). 


a) Prove that the regular continued fraction expansion (in the sense of exercise 10) of 
a quadratic irrationality X = (vD — U)/V is obtained by the following formulas: 


Yo = V, Ao = |X], Up =U + AoV; 
Vati = (D — UF) /Va, Anti =|(WD+Un)/Va4i], Unti = AntiVa4i — Un- 


4.5.3 ANALYSIS OF EUCLID’S ALGORITHM 375 


b) Prove that 0 < Un < VD, 0 < Va < 2VD, for all n > N, where N is some 
integer depending on X; hence the regular continued fraction representation of 
every quadratic irrationality is eventually periodic. [Hint: Show that 


(-VD —U)/V = Ao + //A1,..., An, —Vn/(VD + Un) //, 


and use Eq. (5) to prove that (VD + Un)/Vn is positive when n is large] 

c) Letting pn = Kn+i(Ao, A1,..., An) and qn = Kn(Ai,..., An), prove the identity 
Vpn + 2Upngn + ((U? — D)/V) an = (-1)"* V41. 

d) Prove that the regular continued fraction representation of an irrational number X 
is eventually periodic if and only if X is a quadratic irrationality. (This is the 
continued fraction analog of the fact that the decimal expansion of a real number X 
is eventually periodic if and only if X is rational.) 

13. [M40] (J. Lagrange, 1767.) Let f(a) = anz” +---+ ao, an > 0, be a polynomial 
having exactly one real root € > 1, where € is irrational and f’(€) # 0. Experiment 
with a computer program to find the first thousand or so partial quotients of €, using 
the following algorithm (which essentially involves only addition): 


L1. Set A< 1. 

L2. For k = 0, 1, ..., n— 1 (in this order) and for j = n— 1, ..., k (in this order), 
set aj < aj41+4,;. (This step replaces f(x) by g(x) = f(a+1), a polynomial 
whose roots are one less than those of f.) 


L3. If an + an-ı +--:+a0 <0, set A 4+ A+ 1 and return to L2. 


L4. Output A (which is the value of the next partial quotient). Replace the coeffi- 
cients (an, an—1,. . ., a0) by (—ao, —a@1,..., —@n) and return to L1. (This step 
replaces f(x) by a polynomial whose roots are reciprocals of those of f.) JJ 

For example, starting with f(x) = z? — 2, the algorithm will output “1” (changing 

f(a) to z? — 3x? — 3a — 1); then “3” (changing f(x) to 102° — 6x? — 6a — 1); etc. 
14. [M22] (A. Hurwitz, 1891.) Show that the following rules make it possible to find 
the regular continued fraction expansion of 2X, given the partial quotients of X: 


2//2a,b,c,...// = fa, 2b + 2//c,... // ff; 
2//2a+1,b,¢,...f=fa1,1+2/b-1,¢,...//f. 


Use this idea to find the regular continued fraction expansion of le, given the expansion 
of e in (13). 

15. [M31] (R. W. Gosper.) Generalizing exercise 14, design an algorithm that com- 
putes the continued fraction Xo + //X1,X2,...// for (ax + b)/(cx + d), given the 
continued fraction xo + //x1,x2,...// for x, and given integers a, b, c, d with ad Æ bc. 
Make your algorithm an “online coroutine” that outputs as many Xọķ as possible before 
inputting each zj. Demonstrate how your algorithm computes (97x + 39) /(—62x — 25) 
when x = —1 + //5,1,1,1,2,1,2/. 

16. [HM30] (L. Euler, 1731.) Let fo(z) = (e* — e-*)/(e* +e *) = tanhz, and let 
fngi(z) = 1/fn(z) — (2n + 1)/z. Prove that, for all n, fn(z) is an analytic function of 
the complex variable z in a neighborhood of the origin, and it satisfies the differential 
equation f(z) = 1 — fn(z)? — 2nfn(z)/z. Use this fact to prove that 


tanh z = He, 3271, 5271, 72... ff; 
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then apply Hurwitz’s rule (exercise 14) to prove that 


e/" = 1, 2m4 n- 1,1/, m > 0. 


(This notation denotes the infinite continued fraction //1, n — 1, 1, 1, 3n — 1, 1, 1, 


5n — 1, 1, ...//.) Also find the regular continued fraction expansion of e?/" when 
n > 0 is odd. 

17. [M23] (a) Prove that //z1ı, —£2// = //xı — 1,1,%2 — 1//. (b) Generalize this 
identity, obtaining a formula for //x1, —£2, £3, —£4, £5, —%6,.-.,£2n—1, —L2n// in which 


all partial quotients are positive integers when the x’s are large positive integers. 
(c) The result of exercise 16 implies that tan 1 = //1,—-3,5,—7,...//. Find the regular 
continued fraction expansion of tan 1. 


18. [M25] Show that //a1,a@2,...,@m,1, 01, 42,.--,@m,£2,01,02,.--,Am,%3,...// — 
| Qmy+++3@2,41,21,Am,.--,42,41,22,A4m,.--,42,01,23,...// does not depend on 21, 
v2, 23,.... Hint: Multiply both continued fractions by Km(a1,@2,...,@m). 


19. [M20] Prove that F(x) = log,(1+ x) satisfies Eq. (24). 
20. [HM20] Derive (38) from (37). 


21. [HM29] (E. Wirsing.) The bounds (39) were obtained for a function ọ corre- 
sponding to g with Tg(x) = 1/(x + 1). Show that the function corresponding to 
T g(x) = 1/(x + c) yields better bounds, when c > 0 is an appropriate constant. 


22. [HM46] (K. I. Babenko.) Develop efficient means to calculate accurate approxi- 
mations to the quantities \; and W;(x) in (44), for small j > 3 and for 0 < x < 1. 


23. [HM23] Prove (53), using results from the proof of Theorem W. 


24. [M22] What is the average value of a partial quotient An in the regular continued 
fraction expansion of a random real number? 


25. [HM25] Find an example of a set Z = I; U I2 U Is U--- C [0..1], where the I’s 
are disjoint intervals, for which (45) does not hold. 


26. [M23] Show that if the numbers {1/n,2/n,...,|n/2|/n} are expressed as regular 
continued fractions, the result is symmetric between left and right, in the sense that 
|/At,...,A2,A1// appears whenever //Aj, A2,..., Ar// does. 


27. [M21] Derive (55) from (49) and (54). 
28. [M23] Prove the following identities involving the three number-theoretic func- 


tions y(n), u(n), A(n): 


a) X` a(d) = bn1. b) Inn = X Ad), n= X` g(a). 


d\n d\n d\n 


c) A(n) = > u (4) Ind, e(n) = >on (4) d. 
d\n d 


29. [M23] Assuming that Tn is given by (55), show that (57) equals (58). 


30. [HM32] The following “greedy” variant of Euclid’s algorithm is often suggested: 
Instead of replacing v by u mod v during the division step, replace it by |(u mod v) — v| 
if wumodv > 4v. Thus, for example, if u = 26 and v = 7, we have gcd(26,7) = 
gcd(—2,7) = gcd(7,2); —2 is the remainder of smallest magnitude when multiples of 7 
are subtracted from 26. Compare this procedure with Euclid’s algorithm; estimate the 
number of division steps this method saves, on the average. 
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> 31. [M35] Find the worst case of the modification of Euclid’s algorithm suggested in 
exercise 30: What are the smallest inputs u > v > 0 that require n division steps? 
32. [20] (a) A Morse code sequence of length n is a string of r dots and s dashes, 
where r+ 2s = n. For example, the Morse code sequences of length 4 are 


econ .. — o—e —ee — 
? ? j , g 


Noting that the continuant K4(z£1, £2, £3, £4) is £1£2L3£4 + L1£2 + 4104 + £374 + 1, 
find and prove a simple relation between Kn(z1,..., £n) and Morse code sequences of 
length n. (b) (L. Euler, Novi Comm. Acad. Sci. Pet. 9 (1762), 53-69.) Prove that 


Kmin(@1,---,2m+n) = Km("1,.-.-,0%m)Kn(@m4i,--+;lm+4n) 
+ Km-1(@1,..-;Um—1) Kn-1(@m+42,-+-;Lm+4n)- 
33. [M32] Let h(n) be the number of representations of n in the form 
n = xr + yy’, z>y>0, xv’ >y’ >00, zly, integer x, 2’, y, y’ 


a) Show that if the conditions are relaxed to allow x’ = y’, the number of represen- 
tations is h(n) + |(n — 1)/2]. 

b) Show that for fixed y > 0 and 0 < t < y, where t L y, and for each fixed 2’ 
in the range 0 < 2’ < n/(y +t) such that x't = n (modulo y), there is exactly 
one representation of n satisfying the restrictions of (a) and the condition x = t 
(modulo y). 

c) Consequently, h(n) = >> [(n/(y + t) — t’) /y] — |(n — 1)/2], where the sum is over 
all positive integers y, t, t' such that t L y, t < y, t < y, tt’ =n (modulo y). 

d) Show that each of the h(n) representations can be expressed uniquely in the form 


T = Kmi tmh yY = Km-t tm), 
x' = Ky (@m41, Akn. Lm+k) d, y = Ky-1(&m42, Seay Lm-+k) d, 
where m, k, d, and the x; are positive integers with 71 > 2, m4, > 2, and dis a di- 
visor of n. The identity of exercise 32 now implies that n/d = Km+n(X1,-..;%m+k)- 
Conversely, any given sequence of positive integers 71, ..., £m+k such that x1 > 2, 
Lm+k > 2, and Km4+k(£1,..., £m+k) divides n, corresponds in this way to m+k—1 
representations of n. 
e) Therefore nTa = |(5n — 3)/2| + 2h(n). 
34. [HM40] (H. Heilbronn.) Let ha(n) be the number of representations of n as in 
exercise 33 such that xd < x’, plus half the number of representations with xd = 2’. 
a) Let g(n) be the number of representations without the requirement that z L y. 


Prove that 
=P ud (Z), s) =2 Dra). 
d\n 


d\n 
b) Generalizing exercise 33(b), show that for d > 1, ha(n) = $ (n/(yly+t)))+0(n), 
where the sum is over all integers y and t such that t L y and 0 < t < y < yn/d. 
c) Show that $ (y/(y + t)) = y(y)ln2 + O(c_1(y)), where the sum is over the range 
0<t<y,t Ly, and where o-1(y) = Xa (1/8). 
d) Show that Da gly) /y? = De p(d) Ay njaj ig 
e) Hence we have the asymptotic formula 


Tn = ((12In2)/n”)(Inn — X` A(d)/d) + O(o-1(n)”). 


d\n 
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35. [HM41] (A. C. Yao and D. E. Knuth.) Prove that the sum of all partial quotients 
for the fractions m/n, for 1 < m < n, is equal to 2(>>|z/y| + |n/2]), where the sum is 
over all representations n = xg’ + yy’ satisfying the conditions of exercise 33(a). Show 
that E |a/y| = 307?n(Inn)? + O(nlog n (log log n)), and apply this to the “ancient” 
form of Euclid’s algorithm that uses only subtraction instead of division. 

36. [M25] (G. H. Bradley.) What is the smallest value of un such that the calculation 
of gcd(wi,...,tn) by Algorithm 4.5.2C requires N divisions, if Euclid’s algorithm is 
used throughout? Assume that N > n > 3. 


37. [M38] (T.S. Motzkin and E. G. Straus.) Let a1, ..., an be positive integers. Show 
that max Kn (ap(1),---,@p(n)), over all permutations p(1)...p(n) of {1,2,...,n}, occurs 
when ap(1) > Ap(n) > Ap(2) È Ap(n—1) 2 +++; and the minimum occurs when apa) < 


Ap(n) S p(s) S Ap(n—2) S Ap) S++ S Ap(6) S Ap(m—s) S Bplay S Mp(n—1) S Apca)- 


38. [M25] (J. Mikusitiski.) Let L(n) = maxm>o T(m,n). Theorem F shows that 
L(n) < logy(V5n + 1) — 2; prove that 2L(n) > logy(VW5n + 1) — 2. 


39. [M25] (R. W. Gosper.) If a baseball player’s batting average is .334, what is the 
smallest possible number of times he has been at bat? [Note for non-baseball-fans: 
Batting average = (number of hits)/(times at bat), rounded to three decimal places.] 


40. [M28] (The Stern—Brocot tree.) Consider an infinite binary tree in which each 
node is labeled with the fraction (pı + pr)/(qı + qr), where pi/q is the label of the 
node’s nearest left ancestor and pr /qr is the label of the node’s nearest right ancestor. 
(A left ancestor is one that precedes a node in symmetric order, while a right ancestor 
follows the node. See Section 2.3.1 for the definition of symmetric order.) If the node 
has no left ancestors, pı/qı = 0/1; if it has no right ancestors, pr/qr = 1/0. Thus the 
label of the root is 1/1; the labels of its two children are 1/2 and 2/1; the labels of the 
four nodes on level 2 are 1/3, 2/3, 3/2, and 3/1, from left to right; the labels of the 
eight nodes on level 3 are 1/4, 2/5, 3/5, 3/4, 4/3, 5/3, 5/2, 4/1; and so on. 

Prove that p is relatively prime to q in each label p/q; furthermore, the node 
labeled p/q precedes the node labeled p'/q’ in symmetric order if and only if the labels 
satisfy p/q < p’/q’. Find a connection between the continued fraction for the label of 
a node and the path to that node, thereby showing that each positive rational number 
appears as the label of exactly one node in the tree. 


41. [M40] (J. Shallit, 1979.) Show that the regular continued fraction expansion of 


1 1 1 1 
ai t 53 ty RSS sen 


n>1 


contains only 1s and 2s and has a fairly simple pattern. Prove that the partial quotients 
of Liouville’s numbers 57,5, I-™ also have a regular pattern, when l is any integer 
> 2. [The latter numbers, introduced by J. Liouville in J. de Math. Pures et Appl. 16 
(1851), 133-142, were the first explicitly defined numbers to be proved transcendental. 
The former number and similar constants were first proved transcendental by A. J. 
Kempner, Trans. Amer. Math. Soc. 17 (1916), 476—482.] 


42. [M30] (J. Lagrange, 1798.) Let X have the regular continued fraction expansion 
|/Ai, A2,... //, and let qn = Kn(Ai,...,An). Let ||x|| denote the distance from x to 
the nearest integer, namely min, |x — p|. Show that ||qX|| > ||q@n—1X|| for 1 < q < qn. 
(Thus the denominators qn of the so-called convergents pn/dn = //A1,...,An// are the 
“record-breaking” integers that make ||qX|| achieve new lows.) 
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43. [M30] (D. W. Matula.) Show that the “mediant rounding” rule for fixed slash 
or floating slash numbers, Eq. 4.5.1—-(1), can be implemented simply as follows, when 
the number x > 0 is not representable: Let the regular continued fraction expansion 
of x be ao + //a1,a2,... //, and let pn = Kn41(ao,---;@n); qn = Kn(a1,..., an). Then 
round(«) = (pi/qi), where (pi/qi) is representable but (pi+1/qi+1) is not. [Hint: See 
exercise 40.] 


44. [M25] Suppose we are doing fixed slash arithmetic with mediant rounding, where 
the fraction (u/u’) is representable if and only if |u| < M and0<u' < N and u Lw. 
Prove or disprove the identity ((u/u’) © (v/v’)) © (v/v’) = (u/u’) for all representable 
(u/u’) and (v/v’), provided that u’ < VN and no overflow occurs. 


45. [M25] Show that Euclid’s algorithm (Algorithm 4.5.2A) applied to two n-bit 
binary numbers requires O(n”) units of time, as n —> oo. (The same upper bound 
obviously holds for Algorithm 4.5.2B.) 


46. [M43] Can the upper bound O(n?) in exercise 45 be decreased, if another algo- 
rithm for calculating the greatest common divisor is used? 


47. [M40] Develop a computer program to find as many partial quotients of x as 
possible, when x is a real number given with high precision. Use your program to 
calculate the first several thousand partial quotients of Euler’s constant y, which can 
be calculated as explained by D. W. Sweeney in Math. Comp. 17 (1963), 170-178. (If 
y is a rational number, you might discover its numerator and denominator, thereby 
resolving a famous problem in mathematics. According to the theory in the text, we 
expect to get about 0.97 partial quotients per decimal digit, when the given number is 
random. Multiprecision division is not necessary; see Algorithm 4.5.2L and the article 
by J. W. Wrench, Jr. and D. Shanks, Math. Comp. 20 (1966), 444-447.) 

48. [M21] Let To = (1,0,u), Ti = (0,1,v), ..., Ta41 = ((-1)"t10/d, (—1)"u/d, 0) 
be the sequence of vectors computed by Algorithm 4.5.2X (the extended Euclidean 
algorithm), and let //a1,...,@n// be the regular continued fraction for v/u. Express Tj 
in terms of continuants involving a1, ..., an, forl <j <n. 

49. [M33] By adjusting the final iteration of Algorithm 4.5.2X so that an is optionally 
replaced by two partial quotients (a, — 1,1), we can assume that the number of 
iterations, n, has a given parity. Continuing the previous exercise, let À and u be 
arbitrary positive real numbers and let 0 = ,/Awu/d, where d = gcd(u, v). Prove that 
if n is even, and if Tj = (xj, Yj, zj), we have mint |Ax; + 2; — [j even] 0| < 0. 

50. [M25] Given an irrational number a € (0..1) and real numbers 6 and y with 
0< 8< y< 1, let f(a, ß, y) be the smallest nonnegative integer n such that 8 < 
anmod1 < y. (Such an integer exists because of Weyl’s theorem, exercise 3.5-22.) 
Design an algorithm to compute f(a, 8,7). 

51. [M30] (Rational reconstruction.) The number 28481 turns out to be equal to 
41/316 (modulo 199999), in the sense that 316 - 28481 = 41. How could a person 
discover this? Given integers a and m with m > a > 1, explain how to find integers x 
and y such that az = y (modulo m), x L y, 0 < z < Vm/2, and |y| < Vm/2, or to 
determine that no such x and y exist. Can there be more than one solution? 


4.5.4. Factoring into Primes 


Several of the computational methods we have encountered in this book rest on 
the fact that every positive integer n can be expressed in a unique way in the 
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form 

n = Pip2-.-Pt, Pr SP2 S:t L Pr, (1) 
where each pp is prime. (When n = 1, this equation holds for t = 0.) It is 
unfortunately not a simple matter to find this prime factorization of n, or to 
determine whether or not n is prime. So far as anyone knows, it is a great 
deal harder to factor a large number n than to compute the greatest common 
divisor of two large numbers m and n; therefore we should avoid factoring large 
numbers whenever possible. But several ingenious ways to speed up the factoring 
process have been discovered, and we will now investigate some of them. [A 
comprehensive history of factoring before 1950 has been compiled by H. C. 
Williams and J. O. Shallit, Proc. Symp. Applied Math. 48 (1993), 481-531.] 


Divide and factor. First let us consider the most obvious algorithm for factor- 
ization: If n > 1, we can divide n by successive primes p = 2, 3, 5, ... until 
discovering the smallest p for which n mod p = 0. Then p is the smallest prime 
factor of n, and the same process may be applied to n + n/p in an attempt 
to divide this new value of n by p and by higher primes. If at any stage we 
find that n mod p 4 0 but [n/p] < p, we can conclude that n is prime; for 
if n is not prime, then by (1) we must have n > p?, but pı > p implies that 
pî > (pt+1)? > p(p+1) > p? +(n mod p) > |n/p]p+ (n mod p) = n. This leads 
us to the following procedure: 


Algorithm A (Factoring by division). Given a positive integer N, this algorithm 
finds the prime factors pı < po < --- < p, of N as in Eq. (1). The method makes 
use of an auxiliary sequence of trial divisors 


2=dy<d,<d,<d3<-:--, (2) 


which includes all prime numbers < v N (and possibly values that are not prime, 
if convenient). The sequence of d’s must also include at least one value such that 
dk > VN. 


A1. [Initialize] Set t + 0, k + 0, n + N. (During this algorithm the variables 
t, k, n are related by the following condition: “n = N/pı... pı, and n has 
no prime factors less than d;.”) 


A2. [n = 1?] If n = 1, the algorithm terminates. 


A3. [Divide.] Set q < |n/dpk], r 4 n mod dx. (Here q and r are the quotient 
and remainder obtained when n is divided by dx.) 


AA. [Zero remainder?] If r # 0, go to step A6. 
A5. [Factor found.] Increase t by 1, and set p; + dp, n + q. Return to step A2. 


A6. [Low quotient?] If q > dp, increase k by 1 and return to step A3. 


A7. |n is prime.] Increase t by 1, set p < n, and terminate the algorithm. J 


As an example of Algorithm A, consider the factorization of the number 
N = 25852. We find immediately that N = 2-12926; hence pı = 2. Furthermore, 
12926 = 2- 6463, so pə = 2. But now n = 6463 is not divisible by 2, 3, 5, ..., 19; 
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A1. Initialize g 


A3. Divide 


A5. Factor 
found 


Fig. 11. A simple factoring algorithm. 


we find that n = 23 - 281, hence p3 = 23. Finally 281 = 12 - 23 + 5 and 12 < 23; 
hence p4 = 281. The determination of 25852’s factors has therefore involved a 
total of 12 division operations; on the other hand, if we had tried to factor the 
slightly smaller number 25849 (which is prime), at least 38 division operations 
would have been performed. This illustrates the fact that Algorithm A requires 
a running time roughly proportional to max(p;—1, //p; ). (If t = 1, this formula 
is valid if we adopt the convention po = 1.) 

The sequence do, dı, d2, ... of trial divisors used in Algorithm A can be 
taken to be simply 2, 3, 5, 7, 11, 18, 17, 19, 23, 25, 29, 31, 35, ..., where we 
alternately add 2 and 4 after the first three terms. This sequence contains all 
numbers that are not multiples of 2 or 3; it also includes numbers such as 25, 
35, 49, etc., which are not prime, but the algorithm will still give the correct 
answer. A further savings of 20 percent in computation time can be made by 
removing the numbers 30m + 5 from the list for m > 1, thereby eliminating all 
of the spurious multiples of 5. The exclusion of multiples of 7 shortens the list 
by 14 percent more, etc. A compact bit table can be used to govern the choice 
of trial divisors. 

If N is known to be small, it is reasonable to have a table of all the necessary 
primes as part of the program. For example, if N is less than a million, we 
need only include the 168 primes less than a thousand (followed by the value 
digg = 1000, to terminate the list in case N is a prime larger than 9977). Such 
a table can be set up by means of a short auxiliary program; see, for example, 
Algorithm 1.3.2P or exercise 8. 

How many trial divisions are necessary in Algorithm A? Let m(x) be the 
number of primes < x, so that 7(2) = 1, 7(10) = 4; the asymptotic behavior 
of this function has been studied extensively by many of the world’s greatest 
mathematicians, beginning with Legendre in 1798. Numerous advances made 
during the nineteenth century culminated in 1899, when Charles de La Vallée 
Poussin proved that, for some A > 0, 


a(x) = a — + O(ae-4V 87), (3) 
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[Mém. Couronnés Acad. Roy. Belgique 59 (1899), 1-74; see also J. Hadamard, 
Bull. Soc. Math. de France 24 (1896), 199-220.] Integrating by parts yields 


o g xo ç 2de , i rlz Lo z 
mal Ina j (ng)? (na) °° mna =) (4) 
for all fixed r > 0. The error term in (3) has subsequently been improved; 
for example, it can be replaced by O (x exp(—A(log x)/°/(log log z)!/°)). [See 
A. Walfisz, Weyl’sche Exponentialsummen in der neueren Zahlentheorie (Berlin: 
1963), Chapter 5.] Bernhard Riemann conjectured in 1859 that 


n(x) ~ L(V) = L(x) ~ 5L (Vz) — 5 L( Ya) + (5) 


where L(x) = fy dt/Int, and his formula agrees well with actual counts when x 
is of reasonable size: 


x n(x) L(x) Riemann’s formula 
108 168 176.6 168.3 
106 78498 78626.5 78527.4 
10° 50847534 50849233.9 50847455.4 
101? 37607912018 37607950279.8 37607910542.2 


1015 29844570422669 29844571475286.5 29844570495886.9 
10!8 24739954287740860 24739954309690414.0 24739954284239494.4 


(See exercise 41.) However, the distribution of large primes is not that simple, 
and Riemann’s conjecture (5) was disproved by J. E. Littlewood in 1914; see 
Hardy and Littlewood, Acta Math. 41 (1918), 119-196, where it is shown that 
there is a positive constant C such that 


n(x) > L(x) + Cra log log log x/log x 


for infinitely many x. Littlewood’s result shows that prime numbers are inher- 
ently somewhat mysterious, and it will be necessary to develop deep properties 
of mathematics before their distribution is really understood. Riemann made 
another much more plausible conjecture, the famous “Riemann hypothesis,” 
which states that the complex function ¢(z) is zero only when the real part of z is 
equal to 1/2, except in the trivial cases where z is a negative even integer. This 
hypothesis, if true, would imply that m(x) = L(x) +O(,/z log x); see exercise 25. 
Richard Brent has used a method of D. H. Lehmer to verify Riemann’s hypothesis 
computationally for all “small” values of z, by showing that ¢(z) has exactly 
75,000,000 zeros whose imaginary part is in the range 0 < Sz < 32585736.4; all 
of these zeros have Rz = 4 and ¢/(z) #0. [Math. Comp. 33 (1979), 1361-1372.] 

In order to analyze the average behavior of Algorithm A, we would like to 
know how large the largest prime factor p; will tend to be. This question was first 
investigated by Karl Dickman [Arkiv för Mat., Astron. och Fys. 22A, 10 (1930), 
1-14], who studied the probability that a random integer between 1 and «x will 
have its largest prime factor < x“. Dickman gave a heuristic argument to show 
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that this probability approaches the limiting value F(a) as x + oo, where F can 
be calculated from the functional equation 


F(a) =| F(=) = fr0<a<l; F(a)=1, fora>1. (6) 
j = 
His argument was essentially this: Given 0 < t < 1, the number of integers 
less than x whose largest prime factor is between zt and x'+% is xF’(t) dt. The 
number of primes p in that range is r(a'*+4)—m(2') = m(2'+(In z)a! dt)—1(z") = 
x dt/t. For every such p, the number of integers n such that “np < x and 
the largest prime factor of n is < p” is the number of n < x!~’ whose largest 
prime factor is < (xt~*)'/O-9, namely 2!~' F(t/(1 — t)). Hence xF'(t) dt = 
(xt dt/t)(a1~*F(t/(1 — t))), and (6) follows by integration. This heuristic argu- 
ment can be made rigorous; V. Ramaswami [Bull. Amer. Math. Soc. 55 (1949), 
1122-1127] showed that the probability in question for fixed a is asymptotically 
F(a)+O(1/log x), as x — oo, and many other authors have extended the analysis 
[see the survey by Karl K. Norton, Memoirs Amer. Math. Soc. 106 (1971), 9-27]. 
If $ <a < 1, formula (6) simplifies to 


i t \ dt dt 
Fa) =1- | F(=) al | z =itha. 


Thus, for example, the probability that a random positive integer < x has a 
prime factor > yz is 1 — F($) = In2, about 69 percent. In all such cases, 
Algorithm A must work hard. 

The net result of this discussion is that Algorithm A will give the answer 
rather quickly if we want to factor a six-digit number; but for large N the amount 
of computer time for factorization by trial division will rapidly exceed practical 
limits, unless we are unusually lucky. 

Later in this section we will see that there are fairly good ways to determine 
whether or not a reasonably large number n is prime, without trying all divisors 
up to yn. Therefore Algorithm A would often run faster if we inserted a 
primality test between steps A2 and A3; the running time for this improved 
algorithm would then be roughly proportional to p,_;, the second-largest prime 
factor of N, instead of to max(p:_1, yp). By an argument analogous to Dick- 
man’s (see exercise 18), we can show that the second-largest prime factor of a 
random integer < x will be < zê with approximate probability G(8), where 


caf (e(4) r(=4)) 7 for 0S BS 3. g 


Clearly G(8) = 1 for 8 > 4. (See Fig. 12.) Numerical evaluation of (6) and (7) 
yields the following “percentage points”: 


F(a),G(8)= 01 05 10 .20 35 50 65 .80 90 95 .99 
am .2697 .3348 .3785 .4430 .5220 .6065 .7047 .8187 .9048 .9512 .9900 
Bæ .0056 .0273 .0531 .1003 .1611 .2117 .2582 .3104 .3590 .3967 .4517 


Thus, the second-largest prime factor will be < x71!" about half the time, etc. 
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Fig. 12. Probability distribution functions for the 
two largest prime factors of a random integer < zx. 
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The total number of prime factors, t, has also been intensively analyzed. 
Obviously 1 < t < lg N, but these lower and upper bounds are seldom achieved. 
It is possible to prove that if N is chosen at random between 1 and gz, the 
probability that t < InInz + cv InInz approaches 


1 . 2 

erate —u"/2 q 8 
e u 

al a 
as x — ov, for any fixed c. In other words, the distribution of t is essentially 
normal, with mean and variance lnln g; about 99.73 percent of all the large 
integers < x have |t —InIna| < 3VInInz. Furthermore the average value of 
t—InInz for 1 < N < zx is known to approach 


1+ > mi- 1/p) + 1/(»-1)) = 7+ Z Pee) 


p prime n>2 


= 1.03465 38818 97437 91161 97942 98464 63825 46703+. (9) 


[See G. H. Hardy and E. M. Wright, An Introduction to the Theory of Numbers, 
5th edition (Oxford, 1979), §22.11; see also P. Erdés and M. Kac, Amer. J. Math. 
62 (1940), 738-742. | 

The size of prime factors has a remarkable connection with permutations: 
The average number of bits in the kth largest prime factor of a random n-bit 
integer is asymptotically the same as the average length of the kth largest cycle 
of a random n-element permutation, as n — oo. [See D. E. Knuth, Selected 
Papers on Analysis of Algorithms (2000), 329-330, 336-337, for references to 
the relevant literature.] It follows that Algorithm A usually finds a few small 
factors and then begins a long-drawn-out search for the big ones that are left. 

An excellent exposition of the probability distribution of the prime factors 
of a random integer has been given by Patrick Billingsley, AMM 80 (1973), 
1099-1115; see also his paper in Annals of Probability 2 (1974), 749-791. 


Factoring by pseudorandom cycles. Near the beginning of Chapter 3, we 
observed that “a random number generator chosen at random isn’t very random.” 
This principle, which worked against us in that chapter, has the redeeming virtue 
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that it leads to a surprisingly efficient method of factorization, discovered by 
J. M. Pollard [BIT 15 (1975), 331-334]. The number of computational steps 
in Pollard’s method is on the order of \/p;—1, so it is significantly faster than 
Algorithm A when N is large. According to (7) and Fig. 12, the running time 
will usually be well under N14. 

Let f(x) be any polynomial with integer coefficients, and consider the two 
sequences defined by 


ro = yo = A; Im+1 = f (£m) mod N, Ym+1 = f(Ym) mod p, (10) 
where p is any prime factor of N. It follows that 
Ym = Lm mod p, for m > 1. (11) 


Now exercise 3.1-7 shows that we will have ym = Ysim)—1 for some m > 1, 
where ¢(m) is the greatest power of 2 that is < m. Thus £m — tg(m)-1 Will 
be a multiple of p. Furthermore if f(y) mod p behaves as a random mapping 
from the set {0, 1, ...,p — 1} into itself, exercise 3.1-12 shows that the average 
value of the least such m will be of order yp. In fact, exercise 4 below shows 
that this average value for random mappings is less than 1.625 Q(p), where the 
function Q(p) © ./mp/2 was defined in Section 1.2.11.3. If the different prime 
divisors of N correspond to different values of m (as they almost surely will, when 
N is large), we will be able to find them by calculating gcd(am — &e(m)—1, N) 


for m = 1, 2, 3, ..., until the unfactored residue is prime. Pollard called his 
technique the “rho method,” because an eventually periodic sequence such as 
Yo, Yı, --- is reminiscent of the Greek letter p. 


From the theory in Chapter 3, we know that a linear polynomial f(z) = 
ax +c will not be sufficiently random for our purposes. The next-simplest 
case is quadratic, say f(x) = x? +1. We don’t know that this function is 
sufficiently random, but our lack of knowledge tends to support the hypothesis 
of randomness, and empirical tests show that this f does work essentially as 
predicted. In fact, f is probably slightly better than random, since x? + 1 takes 
on only ¿(p + 1) distinct values mod p; see Arney and Bender, Pacific J. Math. 
103 (1982), 269-294. Therefore the following procedure is reasonable: 


Algorithm B (Factoring by the rho method). This algorithm outputs the prime 

factors of a given integer N > 2, with high probability, although there is a chance 

that it will fail. 

B1. [Initialize] Set z + 5, 2/9 + 2, ke 1LlelneN. (During this 
algorithm, n is the unfactored part of N, and the variables x and x’ represent 
the quantities £m mod n and Te(m)—ı mod n in (10), where f(x) = x? + 1, 
A=2,1=&(m), and k = 2l — m.) 

B2. [Test primality.] If n is prime (see the discussion below), output n; the 
algorithm terminates. 


B3. [Factor found?] Set g + ged(x'—x, n). If g = 1, go on to step B4; otherwise 
output g. Now if g = n, the algorithm terminates (and it has failed, because 
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we know that n isn’t prime). Otherwise set n + n/g, x + xmodn, a! 4+ 
x’ mod n, and return to step B2. (Note that g may not be prime; this should 
be tested. In the rare event that g isn’t prime, its prime factors won’t be 
determinable with this algorithm.) 


B4. [Advance.] Set k «+ k-1. If k = 0, set a’ + 2,1 ¢ 2l, k 4 l. Set 
x + (x? +1) modn and return to B3. J 


As an example of Algorithm B, let’s try to factor N = 25852 again. The 
third execution of step B3 will output g = 4 (which isn’t prime). After six 
more iterations the algorithm finds the factor g = 23. Algorithm B has not 
distinguished itself in this example, but of course it was designed to factor big 
numbers. Algorithm A takes much longer to find large prime factors, but it can’t 
be beat when it comes to removing the small ones. In practice, we should run 
Algorithm A awhile before switching over to Algorithm B. 

We can get a better idea of Algorithm B’s prowess by considering the ten 
largest six-digit primes. The number of iterations, m(p), that Algorithm B needs 
to find the factor p is given in the following table: 


p = 999863 999883 999907 999917 999931 999953 999959 999961 999979 999983 
m(p) = 276 409 2106 1561 1593 1091 474 1819 395 814 


Experiments by Tomás Oliveira e Silva indicate that m(p) has an average value 
of about 2,/p, and it never exceeds 16,/p when p < 1000000000. The maximum 
m(p) for p < 10° is m(850112303) = 416784; and the maximum of m(p)/\/p 
occurs when p = 695361131, m(p) = 406244. According to these experimental 
results, almost all 18-digit numbers can be factored in fewer than 64,000 itera- 
tions of Algorithm B (compared to roughly 50,000,000 divisions in Algorithm A). 

The time-consuming operations in each iteration of Algorithm B are the 
multiple-precision multiplication and division in step B4, and the gcd in step B3. 
The technique of “Montgomery multiplication” (exercise 4.3.1-41) will speed 
this up. Moreover, if the gcd operation is slow, Pollard suggests gaining speed 
by accumulating the product mod n of, say, ten consecutive (2’ — x) values 
before taking each gcd; this replaces 90 percent of the gcd operations by a single 
multiplication mod N while only slightly increasing the chance of failure. He 
also suggests starting with m = q instead of m = 1 in step B1, where q is, say, 
one tenth of the number of iterations you are planning to use. 

In those rare cases where failure occurs for large N, we could try using 
f(x) = x? + c for some c 4 0 or 1. The value c = —2 should also be avoided, 
since the recurrence £41 = 22, — 2 has solutions of the form £m =r?” +r?” 
Other values of c do not seem to lead to simple relationships mod p, and they 
should all be satisfactory when used with suitable starting values. 

Richard Brent used a modification of Algorithm B to discover the prime 
factor 1238926361552897 of 279° +1. [See Math. Comp. 36 (1981), 627-630; 38 
(1982), 253-255.] 


Fermat’s method. Another approach to the factoring problem, which was used 
by Pierre de Fermat in 1643, is more suited to finding large factors than small 
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ones. [Fermat’s original description of his method, translated into English, can 
be found in L. E. Dickson’s monumental History of the Theory of Numbers 1 
(Carnegie Inst. of Washington, 1919), 357. An equivalent idea had in fact been 
used already by Narayana Pandita in his remarkable book Ganita Kaumudi 
(1356); see Parmanand Singh, Ganita Bharati 22 (2000), 72—74.] 

Assume that N = uv, where u < v. For practical purposes we may assume 
that N is odd; this means that u and v are odd, and we can let 


x = (u + v)/2, y = (v — u)/2, (12) 
N =o" -y’, O<y<a<N. (13) 
Fermat’s method consists of searching systematically for values of x and y that 


satisfy Eq. (13). The following algorithm shows how factoring can therefore be 
done without using any multiplication or division: 


Algorithm C (Factoring by addition and subtraction). Given an odd number N, 

this algorithm determines the largest factor of N less than or equal to VN. 

C1. [Initialize.] Set a — 2|VN] +1,6<¢ 1, r 4+ |WN|? — N. (During this 
algorithm a, b, and r correspond respectively to 27+1, 2y+1, and x?—y?—N 
as we search for a solution to (13); we will have |r| < a and b < a.) 


C2. [Done?] If r = 0, the algorithm terminates; we have 
N = ((a—b)/2)((a + b — 2)/2), 
and (a — b)/2 is the largest factor of N less than or equal to VN. 
C3. [Increase a.] Set r+ r+aanda¢a+2. 


C4. [Increase b.] Set r + r — b and b + b + 2. 
C5. [Test r.] Return to step C4 if r > 0, otherwise go back to C2. I 


The reader may find it amusing to find the factors of 377 by hand, using this 
algorithm. The number of steps needed to find the factors u and v of N = wv is 
essentially proportional to (a +b—2)/2— |V N] =v- |vVN]; this can, of course, 
be a very large number, although each step can be done very rapidly on most 
computers. An improvement that requires only O(N‘/*) operations in the worst 
case has been developed by R. S. Lehman [Math. Comp. 28 (1974), 637-646]. 

It is not quite correct to call Algorithm C “Fermat’s method,” since Fermat 
used a somewhat more streamlined approach. Algorithm C’s main loop is quite 
fast on computers, but it is not very suitable for hand calculation. Fermat didn’t 
actually maintain the running value of y; he would look at z? — N and guess 
whether or not this quantity was a perfect square by looking at its least significant 
digits. (The last two digits of a perfect square must be 00, el, e4, 25, 06, or 
e9, where e is an even digit and o is an odd digit.) Therefore he avoided the 
operations of steps C4 and C5, replacing them by an occasional determination 
that a certain number is not a perfect square. 

Fermat’s method of looking at the rightmost digits can, of course, be general- 
ized by using other moduli. Suppose for clarity that N = 8616460799, a number 
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whose historic significance is explained below, and consider the following table: 


m if x mod m is then x? mod m is and (x? — N) mod m is 

3 0,1,2 0,1,1 1,2,2 

5  0,1,2,3,4 0,1,4,4,1 1,2,0,0,2 

7  0,1,2,3,4,5,6 0,1,4,2,2,4,1 5,6, 2,0,0,2,6 

8  0,1,2,3,4,5,6,7 0,1,4,1,0,1,4,1 1,2,5,2,1,2,5,2 

11 0,1, 2,3,4,5,6,7,8,9, 10 0,1,4,9,5,3,3,5,9,4,1 10,0,3,8,4, 2,2,4,8,3,0 
2 


If x? — N is to be a perfect square y?, it must have a residue mod m consistent 
with this fact, for all m. For example, if N = 8616460799 and z mod 3 Æ 0, then 
(x? — N) mod 3 = 2, so z? — N cannot be a perfect square; therefore x must be 
a multiple of 3 whenever N = x? — y?. The table tells us, in fact, that 


xmod3 =0; 
xmod5 =0, 2, or 3; 
xmod7 = 2, 3, 4, or 5; (14) 


xmod8 =0 or 4 (hence x mod 4 = 0); 

x mod 11 = 1, 2, 4, 7, 9, or 10. 
This narrows down the search for x considerably. For example, x must be a 
multiple of 12. We must have x > [VN] = 92825, and the least such multiple 
of 12 is 92832. This value has residues (2,5,3) modulo (5,7, 11) respectively, so 
it fails (14) with respect to modulus 11. Increasing x by 12 changes the residue 
mod 5 by 2, mod 7 by 5, and mod 11 by 1; so it is easy to see that the first 
value of x > 92825 that satisfies all of the conditions in (14) is x = 92880. Now 
928802 — N = 10233601, and the pencil-and-paper method for square root tells 
us that 10233601 = 3199? is indeed a perfect square. Therefore we have found 
the desired solution x = 92880, y = 3199, and the factorization is 


8616460799 = (x — y)(x + y) = 89681 - 96079. 


This value of N is interesting because the English economist and logician W. 
S. Jevons introduced it as follows in a well-known book: “Given any two num- 
bers, we may by a simple and infallible process obtain their product, but it is 
quite another matter when a large number is given to determine its factors. Can 
the reader say what two numbers multiplied together will produce the number 
8,616,460,799? I think it unlikely that anyone but myself will ever know.” [The 
Principles of Science (1874), Chapter 7.] We have just seen, however, that Fer- 
mat could have factored N in less than 10 minutes, on the back of an envelope! 
Jevons’s point about the difficulty of factoring versus multiplying is well taken, 
but only if we form the product of numbers that aren’t so close to each other. 

In place of the moduli considered in (14), we can use any powers of distinct 
primes. For example, if we had used 25 in place of 5, we would find that the 
only permissible values of z mod 25 are 0, 5, 7, 10, 15, 18, and 20. This gives 
more information than (14). In general, we will get more information modulo p? 
than we do modulo p, for odd primes p, whenever xz? — N = 0 (modulo p) has a 
solution x. Individual primes p and q are, however, preferable to moduli like p? 
unless p is quite small, because we tend to get even more information mod pq. 
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The modular method just used is called a sieve procedure, since we can 
imagine passing all integers through a “sieve” for which only those values with 
x mod 3 = 0 come out, then sifting these numbers through another sieve that 
allows only numbers with x mod 5 = 0, 2, or 3 to pass, etc. Each sieve by itself 
will remove about half of the remaining values (see exercise 6); and when we sieve 
with respect to moduli that are relatively prime in pairs, each sieve is independent 
of the others because of the Chinese remainder theorem (Theorem 4.3.2C). So if 
we sieve with respect to, say, 30 different primes, only about one value in every 
230 will need to be examined to see if x? — N is a perfect square y?. 


Algorithm D (Factoring with sieves). Given an odd number N, this algorithm 
determines the largest factor of N less than or equal to VN. The procedure 
uses moduli m1, M2, ..., Mp that are relatively prime to each other in pairs and 
relatively prime to N. We assume that we have access to r sieve tables S|i, j] for 
O0<j<m,1<i<r, where 


Sli, j] = [7? — N =y? (modulo m;) has a solution y]. 


D1. [Initialize] Set x + [VN], and set k; + (—x)modm, for 1 <i < r. 

(Throughout this algorithm the index variables k1, k2, ..., k, will be set so 

that k; = (—x) mod m;.) 

D2. [Sieve.] If Sfi, k;] = 1 for 1 < i < r, go to step D4. 

D3. [Step x.] Set x + «+1, and set k; + (ki — 1) mod m; for 1 < i < r. Return 

to step D2. 

D4. [Test x? — N.] Set y + |V? — N] or to [vr2? - N]. If y? = 2? -— N, 
then (x — y) is the desired factor, and the algorithm terminates. Otherwise 
return to step D3. J 


There are several ways to make this procedure run fast. For example, we 
have seen that if N mod 3 = 2, then x must be a multiple of 3; we can set x = 32’, 
and use a different sieve corresponding to x’, increasing the speed threefold. If 
N mod 9 = 1, 4, or 7, then x must be congruent respectively to +1, +2, or +4 
(modulo 9); so we run two sieves (one for x’ and one for x”, where x = 92’ + a 
and x = 9x" — a) to increase the speed by a factor of 45. If Nmod4 = 3, 
then xz mod 4 is known and the speed is increased by an additional factor of 4; 
in the other case, when N mod 4 = 1, x must be odd so the speed may be 
doubled. Another way to double the speed of the algorithm (at the expense of 
storage space) is to combine pairs of moduli, using Mr—-p Mpg in place of mg for 
1l<k< ir. 

An even more important method of speeding up Algorithm D is to use 
the Boolean operations found on most binary computers. Let us assume, for 
example, that MIX is a binary computer with 30 bits per word. The tables 
Sļi, ki] can be kept in memory with one bit per entry; thus 30 values can be 
stored in a single word. The operation AND, which replaces the kth bit of the 
accumulator by zero if the kth bit of a specified word in memory is zero, for 
1 < k < 30, can be used to process 30 values of x at once! For convenience, 
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we can make several copies of the tables S[i, 7] so that the table entries for m; 
involve lem(m,,30) bits; then the sieve tables for each modulus fill an integral 
number of words. Under these assumptions, 30 executions of the main loop in 
Algorithm D are equivalent to code of the following form: 


D2 LD1 Ki rll + kj. 
LDA S1,1 rA + $"[1,rIlJ. 
DEC1 1 ril + rll —-1. 
J1iNN *+2 
INC1 M1 If rll < 0, set rI1 + rll + lem(m1, 30). 
STi Ki ki <rll. 
LD1 K2 rll + k4. 
AND $2,1 rA +| rA & S'[2,rIl]. 
DEC1 1 ril + rll —-1. 
JiNN *+2 
INC1 M2 If rll < 0, set rI1 + rll + lem(mz, 30). 
ST1 K2 ka & rIl. 
LD1 K3 rll + k3. 
see (m3 through m, are like m2) 
ST1 Kr ki + rll. 
INCX 30 x 4+ z +30. 
JAZ D2 Repeat if all sieved out. J 


The number of cycles for 30 iterations is essentially 2 + 8r; if r = 11, this 
means three cycles are being used on each iteration, just as in Algorithm C, and 
Algorithm C involves y = (v — u) more iterations. 

If the table entries for m; do not come out to be an integral number of 
words, further shifting of the table entries would be necessary on each iteration 
in order to align the bits properly. This would add quite a lot of coding to the 
main loop and it would probably make the program too slow to compete with 
Algorithm C unless v/u < 100 (see exercise 7). 

Sieve procedures can be applied to a variety of other problems, not neces- 
sarily having much to do with arithmetic. A survey of these techniques has been 
prepared by Marvin C. Wunderlich, JACM 14 (1967), 10-19. 

F. W. Lawrence proposed the construction of special sieve machines for 
factorization in the 19th century [Quart. J. of Pure and Applied Math. 28 
(1896), 285-311], and E. O. Carissan completed such a device with 14 moduli 
in 1919. [See Shallit, Williams, and Morain, Math. Intelligencer 17,3 (1995), 
41—47, for the interesting story of how Carissan’s long-lost sieve was rediscovered 
and preserved for posterity.] D. H. Lehmer and his associates constructed and 
used many different sieve devices during the period 1926-1989, beginning with 
bicycle chains and later using photoelectric cells and other kinds of technology; 
see, for example, AMM 40 (1933), 401-406. Lehmer’s electronic delay-line sieve, 
which began operating in 1965, processed one million numbers per second. By 
1995 it was possible to construct a machine that sieved 6144 million numbers per 
second, performing 256 iterations of steps D2 and D3 in about 5.2 nanoseconds 
[see Lukes, Patterson, and Williams, Nieuw Archief voor Wiskunde (4) 13 (1995), 
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113-139]. Another way to factor with sieves was described by D. H. and Emma 
Lehmer in Math. Comp. 28 (1974), 625-635. 


Primality testing. None of the algorithms we have discussed so far is an 
efficient way to determine that a large number n is prime. Fortunately, there are 
other methods available for settling this question; efficient techniques have been 
devised by E. Lucas and others, notably D. H. Lehmer [see Bull. Amer. Math. 
Soc. 33 (1927), 327-340]. 

According to Fermat’s theorem (Theorem 1.2.4F), we have 


x?! mod p= 1 


whenever p is prime and x is not a multiple of p. Furthermore, there are 
efficient ways to calculate x”~'modn, requiring only O(logn) operations of 
multiplication mod n. (We shall study them in Section 4.6.3 below.) Therefore 
we can often determine that n is not prime when this relationship fails. 

For example, Fermat once verified that the numbers 2! + 1, 2? +1, 24+1, 
2° +1, and 216 + 1 are prime. In a letter to Mersenne written in 1640, Fermat 
conjectured that 2?” + 1 is always prime, but said he was unable to determine 
definitely whether the number 4294967297 = 2°? + 1 is prime or not. Neither 
Fermat nor Mersenne ever resolved this problem, although they could have done 
it as follows: The number 32” mod (232 + 1) can be computed by doing 32 
operations of squaring modulo 2°? + 1, and the answer is 3029026160; therefore 
(by Fermat’s own theorem, which he discovered in the same year 1640!) the 
number 232 + 1 is not prime. This argument gives us absolutely no idea what 
the factors are, but it answers Fermat’s question. 

Fermat’s theorem is a powerful test for showing nonprimality of a given 
number. When n is not prime, it is always possible to find a value of x < n 
such that z"~!modn Æ 1; experience shows that, in fact, such a value can 
almost always be found very quickly. There are some rare values of n for which 
x”! mod n is frequently equal to unity, but then n has a factor less than ¥/n; 
see exercise 9. 

The same method can be extended to prove that a large prime number n 
really is prime, by using the following idea: If there is a number x for which 
the order of x modulo n is equal to n — 1, then n is prime. (The order of x 
modulo n is the smallest positive integer k such that zë mod n = 1; see Section 
3.2.1.2.) For this condition implies that the numbers zt mod n, ..., 2”~' mod n 
are distinct and relatively prime to n, so they must be the numbers 1, 2, ..., 
n— 1 in some order; thus n has no proper divisors. If n is prime, such a number x 
(called a primitive root of n) will always exist; see exercise 3.2.1.2-16. In fact, 
primitive roots are rather numerous. There are y(n — 1) of them, and this is 
quite a substantial number, since n/y(n — 1) = O(log log n). 

It is unnecessary to calculate x* mod n for all k < n — 1 to determine if the 
order of x is n — 1 or not. The order of x will be n — 1 if and only if 
i) 2”-! modn = 1; 

ii) (°—)/? mod n ¥ 1 for all primes p that divide n — 1. 
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For xê mod n = 1 if and only if s is a multiple of the order of x modulo n. If the 
two conditions hold, and if k is the order of x modulo n, we therefore know that 
k is a divisor of n — 1, but not a divisor of (n — 1)/p for any prime factor p of 
n — 1; the only remaining possibility is k = n — 1. This completes the proof that 
conditions (i) and (ii) suffice to establish the primality of n. 

Exercise 10 shows that we can in fact use different values of x for each of 
the primes p, and n will still be prime. We may restrict consideration to prime 
values of x, since the order of wv modulo n divides the least common multiple of 
the orders of u and v by exercise 3.2.1.2-15. Conditions (i) and (ii) can be tested 
efficiently by using the rapid methods for evaluating powers of numbers discussed 
in Section 4.6.3. But it is necessary to know the prime factors of n—1, so we have 
an interesting situation in which the factorization of n depends on that of n— 1. 
An example. The study of a reasonably typical large factorization will help 
to fix the ideas we have discussed so far. Let us try to find the prime factors 
of 2714+ 1, a 65-digit number. The factorization can be initiated with a bit of 
clairvoyance if we notice that 


9214 + 1= (oie _ 954 + ia + 954 + 1); (15) 


this is a special case of the factorization 4x4 + 1 = (2x? + 2x + 1)(2x? — 2x +1), 
which Euler communicated to Goldbach in 1742 [P. H. Fuss, Correspondance 
Math. et Physique 1 (1843), 145]. The problem now boils down to examining 
each of the 33-digit factors in (15). 

A computer program readily discovers that 2107 — 254+ 1 = 5-857-no, where 


no = 37866809061660057264219253397 (16) 


is a 29-digit number having no prime factors less than 1000. A multiple-precision 
calculation using Algorithm 4.6.3A shows that 
3-1 mod no = 1, 
so we suspect that no is prime. It is certainly out of the question to prove that 
no is prime by trying the 10 million million or so potential divisors, but the 
method discussed above gives a feasible test for primality: Our next goal is to 
factor no — 1. With little difficulty, our computer will tell us that 
no — 1 = 2- 2 . 19 - 107 - 353-4, nı = 13191270754108226049301. 


Here 3”171 mod nı Æ 1, so nı is not prime; by continuing Algorithm A or Algo- 
rithm B we obtain another factor, 

nı = 91813 - no, ng = 143675413657196977. 
This time 3”271 mod nz = 1, so we will try to prove that nz is prime. Casting out 
factors < 1000 yields ng — 1 = 2-2-2-2-3-3-547-ng, where ng = 1824032775457. 
Since 3"8-' mod ng 4 1, we know that ng cannot be prime, and Algorithm A 


finds that n3 = 1103-4, where n4 = 1653701519. The number n4 behaves like 
a prime (that is, 3°4~' mod n4 = 1), so we calculate 


n4 — 1 = 2.7- 19 -23 -137 - 1973. 
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Good; this is our first complete factorization. We are now ready to backtrack 
to the previous subproblem, proving that n4 is prime. Using the procedure 
suggested by exercise 10, we compute the following values: 


x p x4—-)/P mod na z”4—! mod na 

2 2 1 (1) 

2 7 766408626 (1) 

2 19 332952683 (1) 

2 23 1154237810 (1) (17) 
2 137 373782186 (1) 

2 1973 490790919 (1) 

3 2 1 (1) 

5 2 1 (1) 

7 2 1653701518 1 


(Here “(1)” means a result of 1 that needn’t be computed since it can be 
deduced from previous calculations.) Thus n4 is prime, and nz — 1 has been 
completely factored. A similar calculation shows that ng is prime, and this 
complete factorization of no — 1 finally shows, after still another calculation like 
(17), that no is prime. 

The last three lines of (17) represent a search for an integer x that satisfies 
a(r4-l)/2 £ nal = 1 (modulo n4). If n4 is prime, we have only a 50-50 chance 
of success, so the case p = 2 is typically the hardest one to verify. We could 
streamline this part of the calculation by using the law of quadratic reciprocity 
(see exercise 23), which tells us for example that 5979/2 = 1 (modulo q) 
whenever q is a prime congruent to +1 (modulo 5). Merely calculating n4 mod 5 
would have told us right away that x = 5 could not possibly help in showing 
that n4 is prime. In fact, however, the result of exercise 26 implies that the case 
p = 2 doesn’t really need to be considered at all when testing n for primality, 
unless n — 1 is divisible by a high power of 2, so we could have dispensed with 
the last three lines of (17) entirely. 

The next quantity to be factored is the other half of (15), namely 


ns = 210 +254 4+1. 


Since 3”571 mod ns 4 1, we know that ns is not prime, and Algorithm B shows 
that ng = 843589 - ng, where ng = 192343993140277293096491917. Unfortu- 
nately, 3"6-! mod ng Æ 1, so we are left with a 27-digit nonprime. Continuing 
Algorithm B might well exhaust our patience (not our budget — we're using 
idle time on a weekend rather than “prime time”). But the sieve method of 
Algorithm D will be able to crack ng into its two factors, 


ne = 8174912477117 - 23528569104401. 


(It turns out that Algorithm B would also have succeeded, after 6,432,966 iter- 
ations.) The factors of ng could not have been discovered by Algorithm A in a 
reasonable length of time. 
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9214 


Now the computation is complete: + 1 has the prime factorization 


5 - 857 - 843589 - 8174912477117 - 23528569104401 - no, 


where no is the 29-digit prime in (16). A certain amount of good fortune entered 
into these calculations, for if we had not started with the known factorization 
(15) it is quite probable that we would first have cast out the small factors, 
reducing n to neno. This 55-digit number would have been much more difficult 
to factor — Algorithm D would be useless and Algorithm B would have to work 
overtime because of the high precision necessary. 

Dozens of further numerical examples can be found in an article by John 
Brillhart and J. L. Selfridge, Math. Comp. 21 (1967), 87-96. 


Improved primality tests. The procedure just illustrated requires the com- 
plete factorization of n—1 before we can prove that n is prime, so it will bog down 
for large n. Another technique, which uses the factorization of n + 1 instead, is 
described in exercise 15; if n — 1 turns out to be too hard, n+ 1 might be easier. 

Significant improvements are available for dealing with large n. For example, 
it is not difficult to prove a stronger converse of Fermat’s theorem that requires 
only a partial factorization of n — 1. Exercise 26 shows that we could have 
avoided most of the calculations in (17); the three conditions 2.*~' mod n4 = 
gcd(2(4-1)/23 — 1, n4) = ged(2(471)/1973 _ 1, n4) = 1 are sufficient by them- 
selves to prove that n4 is prime. Brillhart, Lehmer, and Selfridge have in fact 
developed a method that works when the numbers n — 1 and n+ 1 have been 
only partially factored [Math. Comp. 29 (1975), 620-647, Corollary 11]: Suppose 
n— 1 = fr~ andn+1 = ftrt, where we know the complete factorizations 
of f~ and f*, and we also know that all factors of r~ and r* are > b. If the 
product (b? f- f+ max(f~, f*)) is greater than 2n, a small amount of additional 
computation, described in their paper, will determine whether or not n is prime. 
Therefore numbers of up to 35 digits can usually be tested for primality in a 
fraction of a second, simply by casting out all prime factors < 30030 from n +1 
[see J. L. Selfridge and M. C. Wunderlich, Congressus Numerantium 12 (1974), 
109-120]. The partial factorization of other quantities like n? +n+1 and n?+1 
can be used to improve this method still further [see H. C. Williams and J. S. 
Judd, Math. Comp. 30 (1976), 157-172, 867-886]. 

In practice, when n has no small prime factors and 3”~! mod n = 1, further 
calculations almost always show that n is prime. (One of the rare exceptions in 
the author’s experience is n = 4(2?8 — 9) = 2341 - 16381.) On the other hand, 
some nonprime values of n are definitely bad news for the primality test we 
have discussed, because it might happen that x”~! mod n = 1 for all x relatively 
prime to n (see exercise 9). The smallest such number is n = 3-11-17 = 561; here 
A(n) = lem(2, 10, 16) = 80 in the notation of Eq. 3.2.1.2-(g), so 28° mod 561 = 
1 = x60 mod 561 whenever zx is relatively prime to 561. Our procedure would 
repeatedly fail to show that such an n is nonprime, until we had stumbled across 
one of its divisors. To improve the method, we need a quick way to determine 
the nonprimality of nonprime n, even in such pathological cases. 
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The following surprisingly simple procedure is guaranteed to do the job with 
high probability: 
Algorithm P (Probabilistic primality test). Given an odd integer n, this algo- 
rithm attempts to decide whether or not n is prime. By repeating the algorithm 
several times, as explained in the remarks below, it is possible to be extremely 
confident about the primality of n, in a precise sense, yet the primality will not 
be rigorously proved. Let n = 1 + 2"q, where q is odd. 
P1. [Generate x.] Let x be a random integer in the range 1 < x < n. 
P2. [Exponentiate.] Set j +— 0 and y + x? mod n. (As in our previous primality 
test, x1 mod n should be calculated in O(log q) steps; see Section 4.6.3.) 
P3. [Done?] (Now y = xz? mod n.) If y = n—1, or if y = 1 and j = 0, terminate 
the algorithm and say “n is probably prime.” If y= 1 and j > 0, go to P5. 
P4. [Increase j.] Increase j by 1. If j < k, set y + y? mod n and return to P3. 


P5. [Not prime.] Terminate and say “n is definitely not prime.” qI 


The idea underlying Algorithm P is that if 2? mod n 4 1 and n = 1 + 2%q is 
prime, the sequence of values 


2q 


k 
zimodn, «27%7modn, «*4?modn, ..., 2? %modn 


will end with 1, and the value just preceding the first appearance of 1 will be 
n—1. (T he only solutions to y? = 1 (modulo p) are y = +1, when p is prime, 
since (y — 1)(y+ 1) must be a multiple of p.) 

Exercise 22 proves the basic fact that Algorithm P will be wrong at most 1/4 
of the time, for all n. Actually it will rarely fail at all, for most n; but the crucial 
point is that the probability of failure is bounded regardless of the value of n. 

Suppose we invoke Algorithm P repeatedly, choosing x independently and 
at random whenever we get to step Pl. If the algorithm ever reports that n is 
nonprime, we can be sure this is so. But if the algorithm reports 25 times in a row 
that n is “probably prime,” we can say that n is “almost surely prime.” For the 
probability is less than (1/4)?° that such a 25-times-in-a-row procedure gives the 
wrong information about its input. This is less than one chance in a quadrillion; 
even if we tested a billion different numbers with such a procedure, the expected 
number of mistakes would be less than Tooo: It’s much more likely that our 
computer has dropped a bit in its calculations, due to hardware malfunctions or 
cosmic radiations, than that Algorithm P has repeatedly guessed wrong! 

Probabilistic algorithms like this lead us to question our traditional stan- 
dards of reliability. Do we really need to have a rigorous proof of primality? 
For people unwilling to abandon traditional notions of proof, Gary L. Miller has 
demonstrated (in slightly weaker form) that if a certain well-known conjecture 
in number theory called the Extended Riemann Hypothesis can be proved, then 
either n is prime or there is an x < 2(ln n)? such that Algorithm P will discover 
the nonprimality of n. [See J. Comp. System Sci. 13 (1976), 300-317. The 
constant 2 in this upper bound is due to Eric Bach, Math. Comp. 55 (1990), 
355-380. See Chapter 8 of Algorithmic Number Theory 1 by E. Bach and J. O. 
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Shallit (MIT Press, 1996), for an exposition of various generalizations of the 
Riemann hypothesis.] Thus, we would have a rigorous way to test primality in 
O(log n)° elementary operations, as opposed to a probabilistic method whose 
running time is O(logn)?, if the Extended Riemann Hypothesis were proved. 
But one might well ask whether any purported proof of that hypothesis will ever 
be as reliable as repeated application of Algorithm P on random 2’s. 

A probabilistic test for primality was proposed in 1974 by R. Solovay and 
V. Strassen, who devised the interesting but more complicated test described 
in exercise 23(b). [See SICOMP 6 (1977), 84-85; 7 (1978), 118.] Algorithm P 
is a simplified version of a procedure due to M. O. Rabin, based in part on 
ideas of Gary L. Miller [see Algorithms and Complexity (1976), 35-36], and 
independently discovered by J. L. Selfridge. B. Arazi [Comp. J. 37 (1994), 219- 
222| has observed that Algorithm P can be speeded up significantly for large n 
by using Montgomery’s fast method for remainders (exercise 4.3.1-41). 

A completely rigorous and deterministic way to test for primality in poly- 
nomial time was finally discovered in 2002 by Manindra Agrawal, Neeraj Kayal, 
and Nitin Saxena, who proved the following result: 


Theorem A. Let r be an integer such that n Lr and the order of n modulo r 
exceeds (lg n)?. Then n is prime if and only if the polynomial congruence 


(z+a)” = 2"+a (modulo z” — 1 and n) 


holds for 0 < z < yr Ign. (See exercise 3.2.2-11(a).) q 

An excellent exposition of this theorem has been prepared by Andrew Gran- 
ville [Bull. Amer. Math. Soc. 42 (2005), 3-38], who presents an elementary proof 
that it yields a primality test with running time Q(logn)® and O(logn)'!. He 
also explains a subsequent improvement due to H. Lenstra and C. Pomerance, 
who showed that the running time can be reduced to O(log n)°** if the poly- 
nomial z” — 1 is replaced by a more general family of polynomials. And he 
discusses refinements by P. Berrizbeitia, Q. Cheng, P. Mihailescu, R. Avanzi, and 
D. Bernstein, leading to a probabilistic algorithm by which a proof of primality 
can almost surely be found in O(log n)*+* steps whenever n is prime. 


Factoring via continued fractions. The factorization procedures we have 
discussed so far will often balk at numbers of 30 digits or more, and another 
idea is needed if we are to go much further. Fortunately there is such an idea; in 
fact, there were two ideas, due respectively to A. M. Legendre and M. Kraitchik, 
which led D. H. Lehmer and R. E. Powers to devise a new technique many years 
ago [Bull. Amer. Math. Soc. 37 (1931), 770-776]. However, the method was not 
used at the time because it was comparatively unsuitable for desk calculators. 
This negative judgment prevailed until the late 1960s, when John Brillhart found 
that the Lehmer—Powers approach deserved to be resurrected, since it was quite 
well suited to computer programming. In fact, he and Michael A. Morrison later 
developed it into the champion of all multiprecision factorization methods that 
were known in the 1970s. Their program would handle typical 25-digit numbers 
in about 30 seconds, and 40-digit numbers in about 50 minutes, on an IBM 
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360/91 computer [see Math. Comp. 29 (1975), 183-205]. The method had its 
first triumphant success in 1970, discovering that 2!7°+1 = 59649589127497217- 
5704689200685129054721. 

The basic idea is to search for numbers x and y such that 


xr? =y? (modulo N), Gemy aN, LAY, ct+yFN. (18) 


Fermat’s method imposes the stronger requirement z? — y? = N, but actually 
the congruence (18) is enough to split N into factors: It implies that N is a 
divisor of x? — y? = (x — y)(x+ y), yet N divides neither x — y nor x + y; hence 
gcd(N, x — y) and gcd(N, x + y) are proper factors of N that can be found by 
the efficient methods of Section 4.5.2. 

One way to discover solutions of (18) is to look for values of x such that 
x? =a (modulo N), for small values of |a|. As we will see, it is often a simple 
matter to piece together solutions of this congruence to obtain solutions of (18). 
Now if £? = a+kNd? for some k and d, with small |a|, the fraction x/d is a good 
approximation to VkN; conversely, if x /d is an especially good approximation 
to VEN, the difference |x? — kNd?| will be small. This observation suggests 
looking at the continued fraction expansion of VkN, since we have seen in 
Eq. 4.5.3-(12) and exercise 4.5.3—42 that continued fractions yield good rational 
approximations. 

Continued fractions for quadratic irrationalities have many pleasant prop- 
erties, which are proved in exercise 4.5.3-12. The algorithm below makes use of 
these properties to derive solutions to the congruence 


x? = (—1) pi pS... pgr (modulo N). (19) 


Here we use a fixed set of small primes pı = 2, p2 = 3, ..., up to pm; only 
primes p such that either p = 2 or (kN)—))/? mod p < 1 should appear in this 
list, since other primes will never be factors of the numbers generated by the 
algorithm (see exercise 14). If (a@1,€01,€11,---,@m1); ---, (Zr, Cor; Cir; +++; mr) 
are solutions of (19) such that the vector sum 

(e01, Ell; ,€m1) pessa (eor, Eilr; +» Emr) = (2e9, vie sy 2em) (20) 


is even in each component, then 


z = (z1... zr) mod N, y= ((—1)%p ...p%) mod N (21) 


yields a solution to (18), except for the possibility that x = +y. Condition (20) 
essentially says that the vectors are linearly dependent modulo 2, so we must 
have a solution to (20) if we have found at least m + 2 solutions to (19). 


Algorithm E (Factoring via continued fractions). Given a positive integer N 
and a positive integer k such that kN is not a perfect square, this algorithm 
attempts to discover solutions to the congruence (19) for a given sequence of 
primes pi, ..-, Dm, by analyzing the convergents of the continued fraction for 
VEN. (Another algorithm, which uses the outputs to discover factors of N, is 
the subject of exercise 12.) 
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Table 1 


AN ILLUSTRATION OF ALGORITHM E 
N = 197209, k = 1, m = 3, pı = 2, p2 = 3, p3 = 5 


U V A P S T Output 
After E1: 876 73 12 5329 1 = 
After E4: 882 145 6 5329 0 29 
After E4: 857 37 23 32418 1 37 
After E4: 751 720 1 159316 0 1 1593167 = +24 - 3? . 5! 
After E4: 852 143 5 191734 1 143 
After E4: 681 215 3 131941 0 43 
After E4: 863 656 1 193139 1 41 
After E4: 883 33 26 127871 0 11 
After E4: 821 136 6 165232 1 17 
After E4: 877 405 2 133218 0 1 1332187 = +2°.34.51 
After E4: 875 24 36 37250 1 1 372507 = —2°.31.5° 
After E4: 490 477 1 93755 0 53 


E1. [Initialize.] Set D + kN, R + |VD], R’< 2R, U' e R', V + D- R’, 
V' +1, A+ |R'/V]|, U + R'-(R' mod V), P’ + R, P + (AR+1) mod N, 
S + 1. (This algorithm follows the general procedure of exercise 4.5.3-12, 
finding the continued fraction expansion of VkN. The variables U, U’, V, V’, 
P, P', A, and S represent, respectively, what that exercise calls |V D] +Un, 
IVD] + Un-1, Va, Vn—1, Pn mod N, pn—ı mod N, An, and n mod 2, where 
n is initially 1. We will always have 0 < V < U < R’, so the highest 
precision is needed only for P and P’.) 
E2. [Advance U, V, S.] Set T + V, V + A(U'- U)+V', VHT, A + |U/V], 
U’ + U, U + R' — (U mod V), S 1-5. 
E3. [Factor V.] (Now we have P?—k NQ? = (—1)°V, for some Q relatively prime 
to P, by exercise 4.5.3-12(c).) Set (e€9,€1,---;€m) + (5,0,...,0), TV. 
Now do the following, for 1 < j < m: If T mod p; = 0, set T + T/p; and 
ej + ej + 1, and repeat this process until T mod p; Æ 0. 


E4. [Solution?] If T = 1, output the values (P, eo, €1,...,@m), which comprise a 
solution to (19). (If enough solutions have been generated, we may terminate 
the algorithm now.) 

E5. [Advance P, P’.] If V £1, set T+ P, P+ (AP + P’) mod N, P’ T, 
and return to step E2. Otherwise the continued fraction process has started 
to repeat its cycle, except perhaps for S, so the algorithm terminates. (The 
cycle will usually be so long that this doesn’t happen.) q 


We can illustrate the application of Algorithm E to relatively small numbers 
by considering the case N = 197209, k = 1, m= 3, pı = 2, p2 = 3, p3 = 5. The 
computation begins as shown in Table 1. 

Continuing the computation gives 25 outputs in the first 100 iterations; in 
other words, the algorithm is finding solutions quite rapidly. But some of the 
solutions are trivial. For example, if the computation above were continued 14 
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more times, we would obtain the output 197197? = 24 . 32 . 5°, which is of no 
interest since 197197 = —12. The first two solutions above are already enough 
to complete the factorization: We have found that 


(159316 - 133218)” = (2? - 33 - 51)? (modulo 197209); 


thus (18) holds with x = (159316 - 133218) mod 197209 = 126308, y = 540. By 
Euclid’s algorithm, gcd(126308— 540, 197209) = 199; hence we obtain the pretty 
factorization 

197209 = 199 - 991. 


We can get some understanding of why Algorithm E factors large numbers 
so successfully by considering a heuristic analysis of its running time, following 
unpublished ideas that R. Schroeppel communicated to the author in 1975. Let 
us assume for convenience that k = 1. The number of outputs needed to produce 
a factorization of N will be roughly proportional to the number m of small primes 
being cast out. Each execution of step E3 takes about order mlog N units of 
time, so the total running time will be roughly proportional to m?log N/P, 
where P is the probability of a successful output per iteration. If we make the 
conservative assumption that V is randomly distributed between 0 and 2VN, the 
probability P is (2/N)~1 times the number of integers < 2V N whose prime 


factors are all in the set {pi,...,Pm}. Exercise 29 gives a lower bound for P, 
from which we conclude that the running time is at most of order 
2/N m? log N log 2/N 
———— where r = | =—=— |. (22) 
m”/r! log Pm 


If we let ln m be approximately 3V In NinInN, we have r = ,/In N/InInN-1, 
assuming that pm = O(mlogm), so formula (22) reduces to 


exp(2,/(In N)(InIn N) + O((log N)*/? (log log N)~1/? (log log log N))). 


Stating this another way, the running time of Algorithm E is expected to be at 
most N€? under reasonably plausible assumptions, where the exponent e(N) ~ 
2\/InIn N/In N goes to 0 as N > oo. 

When N is in a practical range, we should of course be careful not to take 
such asymptotic estimates too seriously. For example, if N = 105° we have 
N‘/o = (lg N)* when a ~ 4.75, and the same relation holds for a ~ 8.42 
when N = 107°. The function NO) has an order of growth that is sort of 
a cross between N1/@ and (lg N)®; but all three of these forms are about the 
same, unless N is intolerably large. Extensive computational experiments by 
M. C. Wunderlich have shown that a well-tuned version of Algorithm E performs 
much better than our estimate would indicate [see Lecture Notes in Math. 751 
(1979), 328-342]; although 2,/InIn N/In N ~ .41 when N = 10°°, he obtained 
running times of about N°! while factoring thousands of numbers in the range 
104% < N < 10". 

Algorithm E begins its attempt to factorize N by essentially replacing N 
by kN, and this is a rather curious way to proceed (if not downright stupid). 
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“Excuse me, do you mind if I multiply your number by 3 before I try to factor 
it?” Nevertheless, it turns out to be a good idea, since certain values of k will 
make the V numbers potentially divisible by more small primes, hence they will 
be more likely to factor completely in step E3. On the other hand, a large value 
of k will make the V numbers larger, hence they will be less likely to factor 
completely; we want to balance these tendencies by choosing k wisely. Consider, 
for example, the divisibility of V by powers of 5. We have P? — kNQ? = (—1)5V 
in step E3, so if 5 divides V we have P? = kNQ? (modulo 5). In this congruence 
Q cannot be a multiple of 5, since it is relatively prime to P, so we may write 
(P/Q)? = kN (modulo 5). If we assume that P and Q are random relatively 
prime integers, so that the 24 possible pairs (P mod 5, Q mod K. # (0,0) are 
equally likely, the probability that 5 divides V is therefore a2 317 9, 0, or Š 
according as kN mod 5 is 0, 1, 2, 3, or 4. Similarly the probability that 25 divides 
V is 0, 0,0, 4& A unless kN is a multiple of 25. In general, given 
an odd prime p with (kN)°—))/? mod p = 1, we find that V is a multiple of p° 

with probability 2/(p®~!(p + 1)); and the average number of times p divides V 
comes to 2p/(p? — 1). This analysis, suggested by R. Schroeppel, suggests that 
the best choice of k is the value that maximizes 


£ 1 
> fpi kN) log pj — 5 log k, (23) 
j=1 


where f is the function defined in exercise 28, since this is essentially the expected 
value of In(V N/T) when we reach step E4. 

Best results will be obtained with Algorithm E when both k and m are well 
chosen. The proper choice of m can only be made by experimental testing, since 
the asymptotic analysis we have made is too crude to give sufficiently precise 
information, and since a variety of refinements to the algorithm tend to have 
unpredictable effects. For example, we can make an important improvement by 
comparing step E3 with Algorithm A: The factoring of V can stop whenever we 
find T mod p; 4 0 and |T/p, | < pj, since T will then be either 1 or prime. If T is 
a prime greater than pm (it will be at most p?,+p,,—1 in such a case), we can still 
output (P, eo,...,€m, T), since a complete factorization has been obtained. The 
second phase of the algorithm will use only those outputs whose prime T’s have 
occurred at least twice. This modification gives the effect of a much longer list 
of primes, without increasing the factorization time. Wunderlich’s experiments 
indicate that m ~ 150 works well in the presence of this refinement, when N is 
in the neighborhood of 10%. 

Since step E3 is by far the most time-consuming part of the algorithm, 
Morrison, Brillhart, and Schroeppel have suggested several ways to abort this 
step when success becomes improbable: (a) Whenever T changes to a single- 
precision value, continue only if |T/p;] > p; and 37-1 mod T Æ 1. (b) Give 
up if T is still > p?, after casting out factors < 7pm. (c) Cast out factors 
only up to ps, say, for batches of 100 or so consecutive V’s; continue the 
factorization later, but only on the V from each batch that has produced the 
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smallest ava: T. r casting out the factors up to ps, it is : cater 
V mod p”! p> fap fs pf pie , where the f’s are small enough to make pf ‘pf * p3 Ppl ph 
fit in single precision, but large enough to make V mod pi itl 9 unlikely. One 
single-precision remainder will therefore characterize the value of V modulo five 
small primes.) 

For estimates of the cycle length in the output of Algorithm E, see H. C. 
Williams, Math. Comp. 36 (1981), 593-601. 


*A theoretical upper bound. From the standpoint of computational complex- 
ity, we would like to know if there is any method of factorization whose expected 
running time can be proved to be O(N“)), where e(N) > 0 as N > œ. We 
have seen that Algorithm E probably has such behavior, but it seems hopeless 
to find a rigorous proof, because continued fractions are not sufficiently well 
disciplined. The first proof that a good factorization algorithm exists in this 
sense was discovered by John Dixon in 1978; Dixon showed, in fact, that it 
suffices to consider a simplified version of Algorithm E, in which the continued 
fraction apparatus is removed but the basic idea of (18) remains. 

Dixon’s method [Math. Comp. 36 (1981), 255-260] is simply this, assuming 
that N is known to have at least two distinct prime factors, and that N is not 
divisible by the first m primes p1, po, -.., Pm: Choose a randori integer X in 
the range 0 < X < N, and let V = X? mod N. If V = 0, the number gcd(X, N) 
is a proper factor of N. Otherwise cast out all of the small prime factors of V as 
in step E3; in other words, express V in the form 


V =p? pay T, (24) 
where T is not divisible by any of the first m primes. If T = 1, the algorithm 
proceeds as in step E4 to output (X,e1,...,€m), which represents a solution 


to (19) with eọ = 0. This process continues with new random values of X until 
there are sufficiently many outputs to discover a factor of N by the method of 
exercise 12. 

In order to analyze this algorithm, we want to find bounds on (a) the 
probability that a random X will yield an output, and (b) the probability that a 
large number of outputs will be required before a factor is found. Let P(m, N) 
be the probability (a), namely the probability that T = 1 when X is chosen at 
random. After M values of X have been tried, we will obtain M P(m, N) outputs, 
on the average; and the number of outputs has a binomial distribution, so the 
standard deviation is less than the square root of the mean. The probability (b) 
is fairly easy to deal with, since exercise 13 proves that the algorithm needs more 
than m + k outputs with probability < 27%. 

Exercise 30 proves that P(m, N) > m”/(r! N) when r = 2|log N/(2log pm) |, 
so we can estimate the running time almost as we did in (22) but with the 
quantity 2V N replaced by N. This time we choose 


= y 2ln N/lnln N +9, 


where |0| < 1 and r is even, and we choose m so that 
r = ln N/ln pm + O(1/loglog N); 


402 ARITHMETIC 4.5.4 


this means 


/nNinhN @ 
Inpm = n s 5 nln N + O(1), 


ln m = ln T (pm) = ln pm — ln ln pm + O(1/log pm) 


NininN 1 
ag N a InIn N + O(log log log N), 


IN = exp(—v21n N Inln N + O(r log log log N)). 

r! 

We will choose M so that Mm”/(r!N) > 4m; thus the expected number of 
outputs MP(m, N) will be at least 4m. The running time of the algorithm is 
of order Mmlog N, plus O(m?) steps for exercise 12; it turns out that O(m?) is 
less than Mm log N, which is 


exp(,/8(In N)(InIn N) + O((log N)'/?(log log N)~'/? (log log log N))). 
The probability that this method fails to find a factor is negligibly small, since 
the probability is at most e~’/? that fewer than 2m outputs are obtained (see 
exercise 31), while the probability is at most 2~™ that no factors are found 
from the first 2m outputs, and m > In N. We have proved the following slight 
strengthening of Dixon’s original theorem: 


Theorem D. There is an algorithm whose running time is O(N“), where 
e(N) = c\/InInN/InN and c is any constant greater than v8, that finds a 
nontrivial factor of N with probability 1—O(1/N), whenever N has at least two 
distinct prime divisors. Į 


Other approaches. Another factorization technique was suggested by John M. 
Pollard [Proc. Cambridge Phil. Soc. 76 (1974), 521-528], who gave a practical 
way to discover prime factors p of N when p — 1 has no large prime factors. 
The latter algorithm (see exercise 19) is probably the first thing to try after 
Algorithms A and B have run too long on a large N. 

A survey paper by R. K. Guy, written in collaboration with J. H. Conway, 
Congressus Numerantium 16 (1976), 49-89, gave a unique perspective on the de- 
velopments up till that time. Guy stated, “I shall be surprised if anyone regularly 
factors numbers of size 108° without special form during the present century”; 
and he was indeed destined to be surprised many times during the next 20 years. 

Tremendous advances in factorization techniques for large numbers were 
made during the 1980s, beginning with Carl Pomerance’s quadratic sieve method 
of 1981 [see Lecture Notes in Comp. Sci. 209 (1985), 169-182]. Then Hendrik 
Lenstra devised the elliptic curve method [Annals of Math. (2) 126 (1987), 649- 
673], which heuristically is expected to take about exp(,/(2 + €)(Inp)(InInp) ) 
multiplications to find a prime factor p. This is asymptotically the square root of 
the running time in our estimate for Algorithm E when p ~ VN, and it becomes 
even better when N has relatively small prime factors. An excellent exposition 
of this method has been given by Joseph H. Silverman and John Tate in Rational 
Points on Elliptic Curves (New York: Springer, 1992), Chapter 4. 
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John Pollard came back in 1988 with another new technique, which has 
become known as the number field sieve; see Lecture Notes in Math. 1554 (1993) 
for a series of papers about this method, which is the current champion for 
factoring extremely large integers. Its running time is predicted to be of order 


exp((64/9 + €)" (In N)"? (ln In N)?/) (25) 


as N — co. The crossover point at which a well-tuned version of the number 
field sieve begins to beat a well-tuned version of the quadratic sieve appears to 
be at N ~ 10!!?, according to A. K. Lenstra. 

Details of the new methods are beyond the scope of this book, but we can 
get an idea of their effectiveness by noting some of the early success stories in 
which unfactored Fermat numbers of the form 22° +1 were cracked. For example, 
the factorization 


2512 41 = 2424833 - 
7455602825647884208337395736200454918783366342657 - pog 


was found by the number field sieve, after four months of computation that occu- 
pied otherwise idle time on about 700 workstations [Lenstra, Lenstra, Manasse, 
and Pollard, Math. Comp. 61 (1993), 319-349; 64 (1995), 1357]; here pgg denotes 
a 99-digit prime number. The next Fermat number has twice as many digits, 
but it yielded to the elliptic curve method on October 20, 1995: 


21024 4 1 = 45592577 - 6487031809 - 
4659775785220018543264560743076778192897 - p252. 


[Richard Brent, Math. Comp. 68 (1999), 429-451.] In fact, Brent had already 
used the elliptic curve method to resolve the next case as early as 1988: 


97048 | 1 = 319489 - 974849 - 
167988556341760475137 - 3560841906445833920513 - P564; 


by a stroke of good luck, all but one of the prime factors was < 10%, so the 
elliptic curve method was a winner. 

What about 24096 +1? At present, that number seems completely out of 
reach. It has five factors < 10°, but the unfactored residual has 1187 decimal 
digits. The next case, 28192 + 1, has four known factors < 102” [Crandall and 
Fagin, Math. Comp. 62 (1994), 321; Brent, Crandall, Dilcher, and van Halewyn, 
Math. Comp. 69 (2000), 1297-1304] and a huge unfactored residual. 


Secret factors. Worldwide interest in the problem of factorization increased 
dramatically in 1977, when R. L. Rivest, A. Shamir, and L. Adleman discovered 
a way to encode messages that can apparently be decoded only by knowing the 
factors of a large number N, even though the method of encoding is known to 
everyone. Since a significant number of the world’s greatest mathematicians 
have been unable to find efficient methods of factoring, this scheme [CACM 21 
(1978), 120-126] almost certainly provides a secure way to protect confidential 
data and communications in computer networks. 
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Let us imagine a small electronic device called an RSA boz that has two large 
prime numbers p and q stored in its memory. We will assume that p—1 and q—1 
are not divisible by 3. The RSA box is connected somehow to a computer, and 
it has told the computer the product N = pq; however, no human being will be 
able to discover the values of p and q except by factoring N, since the RSA box 
is cleverly designed to self-destruct if anybody tries to tamper with it. In other 
words, it will erase its memory if it is jostled or if it is subjected to any radiation 
that could change or read out the data stored inside. Furthermore, the RSA 
box is sufficiently reliable that it never needs to be maintained; we simply would 
discard it and buy another, if an emergency arose or if it wore out. The prime 
factors p and q were generated by the RSA box itself, using some scheme based 
on truly random phenomena in nature like cosmic rays. The important point is 
that nobody knows p or q, not even a person or organization that owns or has 
access to this RSA box; there is no point in bribing or blackmailing anyone or 
holding anybody hostage in order to discover N’s factors. 

To send a secret message to the owner of an RSA box whose product number 
is N, you break the message up into a sequence of numbers (21,...,2%), where 
each z; lies in the range 0 < x; < N; then you transmit the numbers 


(x? mod N, ..., £} mod N). 


The RSA box, knowing p and q, can decode the message, because it has pre- 
computed a number d < N such that 3d = 1 (modulo (p — 1)(q — 1)); it can 
now compute each secret component (x? mod N)? mod N = z; in a reasonable 
amount of time, using the method of Section 4.6.3. Naturally the RSA box keeps 
this magic number d to itself; in fact, the RSA box might choose to remember 
only d instead of p and q, because its only duties after having computed N are 
to protect its secrets and to take cube roots mod N. 

Such an encoding scheme is ineffective if £x < VN, since x? mod N = gz 
and the cube root will easily be found. The logarithmic law of leading digits in 
Section 4.2.4 implies that the leading place zı of a k-place message (#1,..., £k) 
will be less than WN about 3 of the time, so this is a problem that needs to be 
resolved. Exercise 32 presents one way to avoid the difficulty. 

The security of the RSA encoding scheme relies on the fact that nobody has 
been able to discover how to take cube roots quickly mod N without knowing 
N’s factors. It seems likely that no such method will be found, but we cannot be 
absolutely sure. So far all that can be said for certain is that all of the ordinary 
ways to discover cube roots will fail. For example, there is essentially no point 
in trying to compute the number d as a function of N; the reason is that if 
d is known, or in fact if any number m of reasonable size is known such that 
x™ mod N = 1 holds for a significant number of «’s, then we can find the factors 
of N in a few more steps (see exercise 34). Thus, any method of attack based 
explicitly or implicitly on finding such an m can be no better than factoring. 

Some precautions are necessary, however. If the same message is sent to 
three different people on a computer network, a person who knows x? modulo N4, 
Nz, and N3 could reconstruct £3 mod N,N2N3 = x? by the Chinese remainder 


3 
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theorem, so x would no longer be a secret. In fact, even if a “time-stamped” 
message (208 tly + ti) mod N; is sent to seven different people, with known 
or guessable t;, the value of x can be deduced (see exercise 44). Therefore 
some cryptographers have recommended encoding with the exponent 216 + 1 = 
65537 instead of 3; this exponent is prime, and the computation of 7®°°37 mod N 
takes only about 8.5 times as long as the computation of z? mod N. [CCITT 
Recommendations Blue Book (Geneva: International Telecommunication Union, 
1989), Fascicle VIII.8, Recommendation X.509, Annex C, pages 74-76.] The 
original proposal of Rivest, Shamir, and Adleman was to encode x by z“ mod N 
where a is any exponent prime to y(NV), not just a = 3; in practice, however, we 
prefer an exponent for which encoding is faster than decoding. 

The numbers p and q shouldn’t merely be “random” primes in order to 
make the RSA scheme effective. We have mentioned that p— 1 and q — 1 should 
not be divisible by 3, since we want to ensure that unique cube roots exist 
modulo N. Another condition is that p — 1 should have at least one very large 
prime factor, and so should q—1; otherwise N can be factored using the algorithm 
of exercise 19. In fact, that algorithm essentially relies on finding a fairly small 
number m with the property that x” mod N is frequently equal to 1, and we 
have just seen that such an m is dangerous. When p—1 and q—1 have large prime 
factors pı and qı, the theory in exercise 34 implies that m is either a multiple of 
pıqı (hence m will be hard to discover) or the probability that x™ = 1 will be less 
than 1/pıqı (hence z™ mod N will almost never be 1). Besides this condition, 
we don’t want p and q to be close to each other, lest Algorithm D succeed in 
discovering them; in fact, we don’t want the ratio p/q to be near a simple fraction, 
otherwise Lehman’s generalization of Algorithm C could find them. 

The following procedure for generating p and q is almost surely unbreakable: 
Start with a truly random number pp between, say, 108° and 1081. Search 
for the first prime number pı greater than po; this will require testing about 
sin po & 90 odd numbers, and it will be sufficient to have pı a “probable prime” 
with probability > 1—271°° after 50 trials of Algorithm P. Then choose another 
truly random number p> between, say, 109 and 104°. Search for the first prime 
number p of the form kp; + 1 where k > po, k is even, and k = pı (modulo 3). 
This will require testing about z In pıp2 ~ 90 numbers before a prime p is found. 
The prime p will be about 120 digits long; a similar construction can be used to 
find a prime q about 130 digits long. For extra security, it is probably advisable 
to check that neither p+1 nor q+ 1 consists entirely of rather small prime factors 
(see exercise 20). The product N = pq, whose order of magnitude will be about 
10250, now meets all of our requirements, and it is inconceivable at this time that 
such an N could be factored. 

For example, suppose we knew a method that could factor a 250-digit 
number N in N°! microseconds. This amounts to 107° microseconds, and there 
are only 31,556,952,000,000 us per year, so we would need more than 3 x 101! 
years of CPU time to complete the factorization. Even if a government agency 
purchased 10 billion computers and set them all to working on this problem, it 
would take more than 31 years before one of them would crack N into factors; 
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meanwhile the fact that the government had purchased so many specialized 
machines would leak out, and people would start using 300-digit N’s. 

Since the encoding method x ++ x? mod N is known to everyone, there 
are additional advantages besides the fact that the code can be cracked only 
by the RSA box. Such “public key” systems were first published by W. Diffie 
and M. E. Hellman in IEEE Trans. IT-22 (1976), 644-654. As an example of 
what can be done when the encoding method is public knowledge, suppose Alice 
wants to communicate with Bob securely via electronic mail, signing her letter 
so that Bob can be sure nobody else has forged it. Let Ea(M) be the encoding 
function for messages M sent to Alice, let Da(M) be the decoding done by 
Alice’s RSA box, and let Eg(M), Dg(M) be the corresponding encoding and 
decoding functions for Bob’s RSA box. Then Alice can send a signed message by 
affixing her name and the date to some confidential message, then transmitting 
Eg(Da(M)) to Bob, using her machine to compute D4(M). When Bob gets this 
message, his RSA box converts it to D4(M), and he knows E4 so he can compute 
M = E,4(D,4(M)). This should convince him that the message did indeed come 
from Alice; nobody else could have sent the message D4(M). (Well, Bob himself 
now knows D,4(M), so he could impersonate Alice by passing Ex (D A(M )) to 
Xavier. To defeat any such attempted forgery, the content of M should clearly 
indicate that it is for Bob’s eyes only.) 

We might ask, how do Alice and Bob know each other’s encoding functions 
E4 and Eg? It wouldn’t do simply to have them stored in a public file, since some 
Charlie could tamper with that file, substituting an N that he has computed 
by himself; Charlie could then surreptitiously intercept and decode a private 
message before Alice or Bob would discover that something is amiss. The solution 
is to keep the product numbers Ny and Np in a special public directory that 
has its own RSA box and its own widely publicized product number Np. When 
Alice wants to know how to communicate with Bob, she asks the directory for 
Bob’s product number; the directory computer sends her a signed message giving 
the value of Ng. Nobody can forge such a message, so it must be legitimate. 

An interesting alternative to the RSA scheme has been proposed by Michael 
Rabin [M.I.T. Lab. for Comp. Sci., report TR-212 (1979)], who suggests encod- 
ing by the function x? mod N instead of x3 mod N. In this case the decoding 
mechanism, which we can call a SQRT box, returns four different messages; the 
reason is that four different numbers have the same square modulo N, namely 
x, —x, fx mod N, and (—fx) mod N, where 


f = (pt? — q?*) mod N. 


If we agree in advance that x is even, or that x < IN , then the ambiguity drops 
to two messages, presumably only one of which makes any sense. The ambiguity 
can in fact be eliminated entirely, as shown in exercise 35. Rabin’s scheme has 
the important property that it is provably as difficult to find square roots mod N 
as to find the factorization N = pq; for by taking the square root of z? mod N 
when x is chosen at random, we have a 50-50 chance of finding a value y such 
that z? = y? and x Æ +y, after which gcd(x — y, N) = p or q. However, the 
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system has a fatal flaw that does not seem to be present in the RSA scheme (see 
exercise 33): Anyone with access to a SQRT box can easily determine the factors 
of its N. This not only permits cheating by dishonest employees, or threats of 
extortion, it also allows people to reveal their p and q, after which they might 
claim that their “signature” on some transmitted document was a forgery. Thus 
it is clear that the goal of secure communication leads to subtle problems quite 
different from those we usually face in the design and analysis of algorithms. 

Historical note: It was revealed in 1997 that Clifford Cocks had considered 
the encoding of messages by the transformation x?% mod pq already in 1973, but 
his work was kept secret. 


The largest known primes. We have discussed several computational methods 
elsewhere in this book that require the use of large prime numbers, and the 
techniques just described can be used to discover primes of up to, say, 25 digits 
or fewer, with relative ease. Table 2 shows the ten largest primes that are less 
than the word size of typical computers. (Some other useful primes appear in 
the answers to exercises 3.2.1.2-22 and 4.6.4-57.) 

Actually much larger primes of special forms are known, and it is occasionally 
important to find primes that are as large as possible. Let us therefore conclude 
this section by investigating the interesting manner in which the largest explicitly 
known primes have been discovered. Such primes are of the form 2” — 1, 
for various special values of n, and so they are especially suited to certain 
applications of binary computers. 

A number of the form 2” — 1 cannot be prime unless n is prime, since 2“” —1 
is divisible by 2“ — 1. In 1644, Marin Mersenne astonished his contemporaries by 
stating, in essence, that the numbers 2? — 1 are prime for p = 2, 3, 5, 7, 13, 17, 
19, 31, 67, 127, 257, and for no other p less than 257. (This statement appeared 
in connection with a discussion of perfect numbers in the preface to his Cogitata 
Physico-Mathematica. Curiously, he also made the following remark: “To tell if 
a given number of 15 or 20 digits is prime or not, all time would not suffice for 
the test, whatever use is made of what is already known.”) Mersenne, who had 
corresponded frequently with Fermat, Descartes, and others about similar topics 
in previous years, gave no proof of his assertions, and for over 200 years nobody 
knew whether he was correct. Euler showed that 2°! — 1 is prime in 1772, after 
having tried unsuccessfully to prove this in previous years. About 100 years 
later, É. Lucas discovered that 2!27 — 1 is prime, but 267 — 1 was questionable; 
therefore Mersenne might not be completely accurate. Then I. M. Pervushin 
proved in 1883 that 2°! — 1 is prime [see Istoriko-Mat. Issledovaniia 6 (1953), 
559], and this touched off speculation that Mersenne had only made a copying 
error, writing 67 for 61. Eventually other errors in Mersenne’s statement were 
discovered; R. E. Powers [AMM 18 (1911), 195] showed that 28° — 1 is prime, 
as had been conjectured by some earlier writers, and three years later he proved 
that 2107 — 1 also is prime. M. Kraitchik found in 1922 that 2757 — 1 is not prime 
[see his Recherches sur la Théorie des Nombres (Paris: 1924), 21]; computational 
errors may have crept in to his calculations, but his conclusion has turned out 
to be correct. 
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Table 2 
USEFUL PRIME NUMBERS 
N ay a2 a3 a4 as as a7 ag ag aio 
2e 19 49 51 55 61 75 81 115 121 135 
216 15 17 39 57 87 89 99 113 117 123 
oe 1 9 13 31 49 61 63 85 91 99 
218 5 11 17 23 33 35 41 65 75 93 
219 1 19 27 31 45 57 67 69 85 87 
270 3 5 17 27 59 69 129 143 153 185 
J 9 19 21 55 61 69 105 111 121 129 
32a 3 17 27 33 57 87 105 113 117 123 
523 15 21 27 37 61 69 135 147 157 159 
974 3 17 33 63 75 77 89 95 117 167 
25 39 49 61 85 91 115 141 159 165 183 
526 5 27 45 87 101 107 111 117 125 135 
gaT 39 79 111 115 135 187 199 219 231 235 
28 57 89 95 119 125 143 165 183 213 273 
279 3 33 43 63 73 75 93 99 121 133 
230 35 41 83 101 105 107 135 153 161 173 
Dee 1 19 61 69 85 99 105 151 159 171 
2°? 5 17 65 99 107 135 153 185 209 267 
233 9 25 49 79 105 285 301 303 321 355 
234 41 77 113 131 143 165 185 207 227 281 
235 31 49 61 69 79 121 141 247 309 325 
236 5 17 23 65 117 137 159 173 189 233 
gat 25 31 45 69 123 141 199 201 351 375 
238 45 87 107 131 153 185 191 227 231 257 
O38 7 19 67 91 135 165 219 231 241 301 
240 87 167 195 203 213 285 293 299 389 437 
Jar 21 31 55 63 73 75 91 111 133 139 
242 11 17 33 53 65 143 161 165 215 227 
248 57 67 117 175 255 267 291 309 319 369 
ga 17 117 119 129 143 149 287 327 359 377 
245 55 69 81 93 121 133 139 159 193 229 
26 21 57 63 77 167 197 237 287 305 311 
Jar 115 127 147 279 297 339 435 541 619 649 
248 59 65 89 93 147 165 189 233 243 257 
952 55 99 225 427 517 607 649 687 861 871 
280 93 107 173 179 257 279 369 395 399 453 
283 25 165 259 301 375 387 391 409 457 471 
264 59 83 95 179 189 257 279 323 353 363 
10° 17 21 39 41 47 69 83 93 117 137 
107 9 27 29 57 63 69 71 93 99 111 
108 11 29 41 59 69 153 161 173 179 213 
10° 63 7i 107 117 203 239 243 249 261 267 


101° 33 57 71 119 149 167 183 213 219 231 
1014 23 53 57 93 129 149 167 171 179 231 
1012 11 39 41 63 101 123 137 143 153 233 
101° 63 83 113 149 183 191 329 357 359 369 


The ten largest primes less than N are N— a, ..., N — aio. 
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Numbers of the form 2? — 1 are now called Mersenne numbers, and it is 
known that Mersenne primes are obtained for p equal to 


2, 3, 5, 7, 13, 17, 19, 31, 61, 89, 107, 127, 521, 607, 1279, 2203, 2281, 
3217, 4253, 4423, 9689, 9941, 11213, 19937, 21701, 23209, 44497, 86243, 
110503, 132049, 216091, 756839, 859433, 1257787, 1398269, 2976221, 
3021377, 6972593, 13466917, 20996011, 24036583, 25964951, 30402457, 
32582657, 37156667, 42643801, 43112609, 57885161, .... (26) 


The first entries above 100000 were found by David Slowinski and associates 
while testing new supercomputers [see J. Recreational Math. 11 (1979), 258- 
261]; he found 756839, 859433, and 1257787 in collaboration with Paul Gage 
during the 1990s. But the remaining exponents, beginning with 1398269, were 
found respectively by Joël Armengaud, Gordon Spence, Roland Clarkson, Nayan 
Hajratwala, Michael Cameron, Michael Shafer, Josh Findley, Martin Nowak, 
Curtis Cooper/Steven Boone, Hans-Michael Elvenich, Odd Magnar Strindmo, 
and Edson Smith using off-the-shelf personal computers, most recently in 2013. 
They used a program by George Woltman, who launched the Great Internet 
Mersenne Prime Search project (GIMPS) in 1996, with Internet administrative 
software contributed subsequently by Scott Kurowski. 

Notice that the prime 8191 = 213 — 1 does not occur in (26); Mersenne had 
stated that 28191 — 1 is prime, and others had conjectured that any Mersenne 
prime could perhaps be used in the exponent. 

The search for large primes has not been systematic, because people have 
generally tried to set a hard-to-beat world record instead of spending time with 
smaller exponents. For example, 2132049 — 1 was proved prime in 1983, and 
2716091 _ 1 in 1984, but the case 2110503 — 1 was not discovered until 1988. 
Therefore one or more unknown Mersenne primes less than 257885161 — 1 might 
still exist. According to Woltman, all exponents < 25,000,000 were checked as 
of March 1, 2008; his volunteers are systematically filling the remaining gaps. 

Since 257885161 _ 1 has more than 17 million decimal digits, it is clear that 
some special techniques have been used to prove that such numbers are prime. 
An efficient way to test the primality of a given Mersenne number 2? — 1 was 
first devised by E. Lucas [Amer. J. Math. 1 (1878), 184-239, 289-321, especially 
page 316] and improved by D. H. Lehmer [Annals of Math. (2) 31 (1930), 419- 
448, especially page 443]. The Lucas—Lehmer test, which is a special case of the 
method now used for testing the primality of n when the factors of n +1 are 
known, is the following: 


Theorem L. Let q be an odd prime, and define the sequence (L,Y by the rule 
Lo = 4, Eni = (L2 — 2) mod (2% — 1). (27) 
Then 24 — 1 is prime if and only if Lg_2 = 0. 


For example, 23 — 1 is prime since Lı = (47 — 2) mod7 = 0. This test is 
particularly well suited to binary computers, since calculation mod (24 — 1) is so 
convenient; see Section 4.3.2. Exercise 4.3.2-14 explains how to save time when 
q is extremely large. 
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Proof. We will prove Theorem L using only very simple principles of number 
theory, by investigating several features of recurring sequences that are of inde- 
pendent interest. Consider the sequences (Un) and (Vp) defined by 


Uo = 0, Uı = 1, Un+1 = 4Un = Un-13 


Wo Vet Raate khe (28) 
The following equations are readily proved by induction: 

Vn = Unti — Un-1; (29) 
Un = ((2+ V3)" — (2 — V3 )”)/V12; (30) 
Vn = (2+ V3)" + (2 — V3)"; (31) 
Onsen = UmUngi — Um-1Un: (32) 

Let us now prove an auxiliary result, when p is prime and e > 1: 
if U,=0 (modulo p°) then Unņp=0 (modulo p°*!). (33) 


This follows from the more general considerations of exercise 3.2.2-11, but a 
direct proof can be given for sequence (28). Assume that U, = bp®, Un41 = a. 
By (32) and (28), Uan = r (2a — 4bp°) = 2aUn (modulo p*t"), while we have 
Uən+1 = U2 - U? = =a’. Similarly, U3n = U2n+1Un — UanUn—1 = 3a?Un and 
Uzgn41 = Uən+1Un+1 — TA = a. In general, 


Ugn = ka, and Urn41 = a” (modulo p°+t!), 


so (33) follows if we take k = p. 
From formulas (30) and (31) we can obtain other expressions for U„ and Vp, 
expanding (2 + v3)” by the binomial theorem: 


n n 
Un a gn—2k-19k Vn — ( ) gn—2k+19k 
Doe T í > 2k (34) 


Now if we set n = p, where p is an odd prime, and if we use the fact that (2) is 
a multiple of p except when k = 0 or k = p, we find that 


U, = 370/2, Vp=4 (modulo p). (35) 

If p # 3, Fermat’s theorem tells us that 3?-! = 1; hence (370/2 — 1) x 

(3@-D/2 + 1) = 0, and 379/2 = +1. When U, = —1, we have Up+4ı = 

4Up—Up-1 = 4Up + Vp—Up+1 = —Up+1; hence Up; mod p = 0. When Up = +1, 

we have Up—1 = 4Up— Up+1 = 4Up — Vp — Up-1 = —Up-1; hence Up—1ı mod p = 0. 
We have proved that, for all primes p, there is an integer e(p) such that 

Upte(p) modp=0,  |e(p)| <1. (36) 


Now if N is any positive integer, and if m = m(JN) is the smallest positive 
integer such that U,,(.) mod N = 0, we have 


Un mod N = 0 if and only if n is a multiple of m(N). (37) 


(This number m(N) is called the rank of apparition of N in the sequence.) 
To prove (37), observe that the sequence Um, Um+1, Um+2, -.. is congruent 
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(modulo N) to aUo, aU;, aU, ..., where a = Um+1 mod N is relatively prime 
to N because gecd(Un, Un+1) = 1. 

With these preliminaries out of the way, we are ready to prove Theorem L. 
By (27) and induction, 

Ln = Von mod (2% — 1). (38) 

Furthermore, the identity 2U,41 = 4Un + Vn implies that gcd(Un, Vn) < 2, since 
any common factor of U, and Vp must divide Up and 2U,41, while Un L Un4+1. 
So Un and V, have no odd factor in common, and if Lyg_2 = 0 we must have 


Upa-1 = Upa-2 Vza-2 = 0 (modulo 24 = 1), 
Upga-2 Æ 0 (modulo 2% — 1). 


Now if m = m(22—1) is the rank of apparition of 24 — 1, it must be a divisor 
of 2971 but not of 27-7; thus m = 2971, We will prove that n = 24 — 1 must 
therefore be prime: Let the factorization of n be pî... pẹ". All primes p; are 
greater than 3, since n is odd and congruent to (—1)% — 1 = —2 (modulo 3). 
From (33), (36), and (37) we know that U, = 0 (modulo 2% — 1), where 


eı—1 


t =lem(p{! "(pi +61), -.., peT (pr + €r)), 

and each ej is +1. It follows that t is a multiple of m = 2971. Let no = 
Tha pe? (o; + €;); we have no < Mi= p (o + łp;) = ()"n. Also, because 
pj + €j is even, t < no /2"7}, since a factor of two is lost each time the least 
common multiple of two even numbers is taken. Combining these results, we 
have m < t < 2(3)"n < 4(2)"m < 3m; hence r < 2 and t = m or t = 2m, 
a power of 2. Therefore e; = 1, e, = 1, and if n is not prime we must have 
n = 21 — 1 = (2* + 1)(2! — 1) where 2* + 1 and 2! — 1 are prime. The latter 
factorization is obviously impossible when q is odd, so n is prime. 

Conversely, suppose that n = 24 — 1 is prime; we must show that V.-2 = 0 
(modulo n). For this purpose it suffices to prove that Vaa-1 = —2 (modulo n), 
since Vgq-1 = (Voq-2)? — 2. Now 


Vaazı = ((V2 + V6)/2)"" + ((v2 — V6)/2)"™ 


./n+1 n+1-2k 2k n+1 
—9-n t 2 = 9(1-n)/2 ( ) k 
pai 2k )v ve > on )” 
Since n is an odd prime, the binomial coefficient 
n+1 n n 
( 2k ) E a y R 
is divisible by n except when 2k = 0 and 2k = n + 1; hence 


2021/2 Vya- =143/2 (modulo n). 


Here 2 = (20+1)/2)2, so Q(r—1)/2 = (2(4+1)/2)(n-1) = 1 by Fermat’s theorem. 
Finally, by a simple case of the law of quadratic reciprocity (see exercise 23), 
3(r-))/2 = —1, since n mod 3 = 1 and n mod 4 = 3. This means Voq-1 = —2, so 
we must have Vz4-2 = 0 as desired. J 


412 ARITHMETIC 4.5.4 


An anonymous author whose works are now preserved in Italian libraries 
had discovered by 1460 that 217 — 1 and 21° — 1 are prime. Ever since then, 
the world’s largest explicitly known prime numbers have almost always been 
Mersenne primes. But the situation might change, since Mersenne primes are 
getting harder to find, and since exercise 27 presents an efficient test for primes 
of other forms. [See E. Picutti, Historia Math. 16 (1989), 123-136; Hugh C. 
Williams, Edouard Lucas and Primality Testing (1998), Chapter 2.] 


EXERCISES 


1. [10] If the sequence do, di, d2, ... of trial divisors in Algorithm A contains a 
number that is not prime, why will it never appear in the output? 


2. [15] If it is known that the input N to Algorithm A is equal to 3 or more, could 
step A2 be eliminated? 


3. [M20] Show that there is a number P with the following property: If 1000 < n < 
1000000, then n is prime if and only if gcd(n, P) = 1. 


4. [M29] In the notation of exercise 3.1-7 and Section 1.2.11.3, prove that the 
average value of the least n such that Xn = Xyn)-1 lies between 1.5Q(m) — 0.5 
and 1.625Q(m) — 0.5. 


5. [21] Use Fermat’s method (Algorithm D) to find the factors of 11111 by hand, 
when the moduli are 3, 5, 7, 8, and 11. 


6. [M24] If p is an odd prime and if N is not a multiple of p, prove that the number 
of integers x such that 0 < 2 < p and z? — N = y? (modulo p) has a solution y is equal 
to (p+ 1)/2. 

7. [25] Discuss the problems of programming the sieve of Algorithm D on a binary 
computer when the table entries for modulus m; do not exactly fill an integral number 
of memory words. 


8. [23] (The sieve of Eratosthenes, 3rd century B.C.) The following procedure evi- 
dently discovers all odd prime numbers less than a given integer N, since it removes 
all the nonprime numbers: Start with all the odd numbers between 1 and N; then 
successively strike out the multiples p2, pp(pr +2), pe(pe +4), ..-, of the kth prime 
pr, for k = 2, 3, 4, ..., until reaching a prime pp with p? > N. 

Show how to adapt the procedure just described into an algorithm that is directly 
suited to efficient computer calculation, using no multiplication. 


9. [M25] Let n be an odd number, n > 3. Show that if the number A(n) of Theorem 
3.2.1.2B is a divisor of n—1 but not equal to n—1, then n must have the form pip2... pt 
where the p’s are distinct primes and t > 3. 

10. [M26] (John Selfridge.) Prove that if, for each prime divisor p of n — 1, there is 
a number zp such that zgr mod n # 1 but r37! mod n = 1, then n is prime. 

11. [M20] What outputs does Algorithm E give when N = 197209, k = 5, m = 1? 
[Hint: y 5-197209 = 992 + //1, 495, 2, 495, 1, 1984//.] 

12. [M28] Design an algorithm that uses the outputs of Algorithm E to find a proper 
factor of N, if Algorithm E has produced enough outputs to deduce a solution of (18). 


13. [HM25] (J. D. Dixon.) Prove that whenever the algorithm of exercise 12 is pre- 
sented with a solution (x, eo, ..., €m) whose exponents are linearly dependent modulo 2 
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on the exponents of previous solutions, the probability is 2'~¢ that a factorization will 
not be found, when N has d distinct prime factors and x is chosen at random. 


14. [M20] Prove that the number T in step E3 of Algorithm E will never be a multiple 
of an odd prime p for which (kN)@-1)/2 mod p > 1. 


> 15. [M34] (Lucas and Lehmer.) Let P and Q be relatively prime integers, and let 
Uo = 0, U1 = 1, Un+i1 = PUn — QUn-1 for n > 1. Prove that if N is a positive integer 
relatively prime to 2P? — 8Q, and if Un41 mod N = 0, while U(n+1)/p mod N ¥ 0 for 
each prime p dividing N + 1, then N is prime. (This gives a test for primality when 
the factors of N + 1 are known instead of the factors of N — 1. We can evaluate Um in 
O(log m) steps as in exercise 4.6.3-26.) [Hint: See the proof of Theorem L.] 


16. [M50] Are there infinitely many Mersenne primes? 


17. [M25] (V. R. Pratt.) A complete proof of primality by the converse of Fermat’s 
theorem takes the form of a tree whose nodes have the form (q,2), where q and x 
are positive integers satisfying the following arithmetic conditions: (i) If (q1, 21), ..., 
(qt, ve) are the children of (q, x) then q = qi... ge +1. [In particular, if (q, x) is childless, 
then q = 2.] (ii) If (r, y) is a child of (q, x), then x(9-))/" mod q # 1. (iii) For each node 
(q, £), we have x17! mod q = 1. From these conditions it follows that q is prime and 
x is a primitive root modulo q, for all nodes (q, x). [For example, the tree 


(1009, 11) 


wee | 


(2,1) (2,1) (2,1) (2,1) (7,3) (3, 2) (3, 2) 


demonstrates that 1009 is prime.] Prove that such a tree with root (q, x) has at most 
f(q) nodes, where f is a rather slowly growing function. 

> 18. [HM23] Give a heuristic proof of (7), analogous to the text’s derivation of (6). 
What is the approximate probability that pe_1 < Vpr? 

> 19. [M25] (J. M. Pollard.) Show how to compute a number M that is divisible by 
all odd primes p such that p—1 is a divisor of some given number D. [Hint: Consider 
numbers of the form a” — 1.] Such an M is useful in factorization, for by computing 
gcd(M, N) we may discover a factor of N. Extend this idea to an efficient method that 
has high probability of discovering prime factors p of a given large number N, when all 
prime power factors of p — 1 are less than 10° except for at most one prime factor less 
than 10°. [For example, the second-largest prime dividing (15) would be detected by 
this method, since it is 1 + 2*- 5? - 67 - 107 - 199 - 41231.] 
20. [M40] Consider exercise 19 with p+ 1 replacing p — 1. 


21. [M49] (R. K. Guy.) Let m(p) be the number of iterations required by Algorithm B 
to cast out the prime factor p. Is m(p) = O(Vplogp) as p > oo? 

> 22. [M30] (M. O. Rabin.) Let pn be the probability that Algorithm P guesses wrong, 
when n is an odd integer > 3. Show that pn < + for all n. 
23. [M35] The Jacobi symbol (7) is defined to be —1, 0, or +1 for all integers p > 0 
and all odd integers q > 1 by the rules (?) = pir V)/2 (modulo q) when q is prime; 
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(2) when q is the product qi... q of t primes (not necessarily distinct). 


@-@.-@ 
Thus it generalizes the Legendre symbol of exercise 1.2.4—47. 


a) Prove that e) satisfies the following relationships, hence it can be computed eff- 
ciently: () = 0; @ = @ = Eg; @ = eo CL) = OE: 
(7) = (-1)@-)G-D/4 (2) if both p and q are odd. [The latter law, which is a 
reciprocity relation reducing the evaluation of (2) to the evaluation of (4), has 


been proved in exercise 1.2.4-47(d) when both p and q are prime, so you may 
assume its validity in that special case.] 


b) (Solovay and Strassen.) Prove that if n is odd but not prime, the number of 
integers x such that 1 < x < n and 0 # (2) = a—)/2 (modulo n) is at most 
s(n). (Thus, the following testing procedure correctly determines whether or 
not a given n is prime, with probability at least 1/2 for all fixed n: “Generate x at 
random with 1 < x < n. f04¢(*2)= a—-)/2 (modulo n), say that n is probably 
prime, otherwise say that n is definitely not prime. ”) 

c) (L. Monier.) Prove that if n and x are numbers for which Algorithm P concludes 
that “n is probably prime”, then 0 4 (=) = 2/2 (modulo n). [Hence Algo- 
rithm P is always superior to the test in (b).] 

> 24. [M25] (L. Adleman.) When n > 1 and x > 1 are integers, n odd, let us say that 


n “passes the x test of Algorithm P” if either z mod n = 0 or if steps P2-P5 lead to 
the conclusion that n is probably prime. Prove that, for any N, there exists a set of 


positive integers £1, ..., 2m < N with m < |lg N] such that a positive odd integer 
in the range 1 < n < N is prime if and only if it passes the x test of Algorithm P for 
xz = xı modn, ..., £ = £m Mod n. Thus, the probabilistic test for primality can in 


principle be converted into an efficient test that leaves nothing to chance. (You need 
not show how to compute the z; efficiently; just prove that they exist.) 


25. [HM41] (B. Riemann.) Prove that 


r(x?) n(x!) z dt elttiT) Ine dt 
| a ij; 
m) 2 3 i Int 2 f ¢ age TOW 


where the sum is over all complex o + ir such that r > 0 and ¢(o + ir) = 0. 


> 26. [M25] (H. C. Pocklington, 1914.) Let N = fr+1, where 0 <r < f+1. Prove 
that N is prime if, for every prime divisor p of f, there is an integer x, such that 


aN! mod N = ged(s 79 —1, N) =1 


p 

> 27. [M30] Show that there is a way to test numbers of the form N = 5-2” + 1 for 
primality, using approximately the same number of squarings mod N as the Lucas- 
Lehmer test for Mersenne primes in Theorem L. [Hint: See the previous exercise.] 


28. [M27] Given a prime p and a positive integer d, what is the value of f(p,d), the 
average number of times that p divides A? —dB? (counting multiplicity), when A and B 
are random integers that are independent except for the condition A | B? 


29. [M25] Prove that the number of positive integers < n whose prime factors are all 
contained in a given set of primes {pi,..., Pm} is at least m"/r!, when r = |log n/log pm | 
and py <-::< Pm- 


30. [HM35] (J. D. Dixon and Claus-Peter Schnorr.) Let pı < -++ < pm be primes 
that do not divide the odd number N, and let r be an even integer < log N/log pm. 
Prove that the number of integers X in the range 0 < X < N such that X? mod N = 
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pil... ppe is at least m"/r!. Hint: Let the prime factorization of N be gi} sagt: 
Show that a sequence of exponents (€1,...,€m) leads to 27 solutions X whenever we 
have e1 +--+ +e€m < r and pj'... pe" is a quadratic residue modulo q; for 1 < i < d. 
Such exponent sequences can be obtained as ordered pairs (e1, . . . , €m; €1, -< - , €m) where 


el ++ en < ir and ef +--- +e, < tr and 


(pit... par) TD = (pi pgr e702 (modulo qi) for 1<i<d. 


31. [M20] Use exercise 1.2.10-21 to estimate the probability that Dixon’s factoriza- 
tion algorithm (as described preceding Theorem D) obtains fewer than 2m outputs. 


32. [M21] Show how to modify the RSA encoding scheme so that there is no problem 
with messages < V N, in such a way that the length of messages is not substantially 
increased. 


33. [M50] Prove or disprove: If a reasonably efficient algorithm exists that has a 
nonnegligible probability of being able to find x mod N, given a number N = pq whose 
prime factors satisfy p = q = 2 (modulo 3) and given the value of x? mod N, then there 
is a reasonably efficient algorithm that has a nonnegligible probability of being able to 
find the factors of N. [If this could be proved, it would not only show that the cube 
root problem is as difficult as factoring, it would also show that the RSA scheme has 
the same fatal flaw as the SQRT scheme.] 


34. [M30] (Peter Weinberger.) Suppose N = pq in the RSA scheme, and suppose you 
know a number m such that «™ mod N = 1 for at least 107"? of all positive integers zx. 
Explain how you could go about factoring N without great difficulty, if m is not too 
large (say m < N°). 

35. [M25] (H. C. Williams, 1979.) Let N be the product of two primes p and q, 


where p mod 8 = 3 and qmod 8 = 7. Prove that the Jacobi symbol satisfies (57) = 


(2) = —(42), and use this property to design an encoding/decoding scheme analogous 


to Rabin’s SQRT box but with no ambiguity of messages. 

36. [HM24] The asymptotic analysis following (22) is too coarse to give meaningful 
values unless N is extremely large, since InIn N is always rather small when N is in a 
practical range. Carry out a more precise analysis that gives insight into the behavior 
of (22) for reasonable values of N; also explain how to choose a value of Inm that 
minimizes (22) except for a factor of size at most exp(O(log log N)). 


37. [M27] Prove that the square root of every positive integer D has a periodic 
continued fraction of the form 


VD = R+ J/ar,...,4n,2R,a1,...,4n,2R,a1,...,an,2R,... //, 


unless D is a perfect square, where R = |V D] and (a1,...,an) is a palindrome (that 
is, a; = an+1—i for 1 < i < n). 

38. [25] (Useless primes.) For 0 < d < 9, find Pa, the largest 50-digit prime number 
that has the maximum possible number of decimal digits equal to d. (First maximize 
the number of d’s, then find the largest such prime.) 


39. [40] Many primes p have the property that 2p + 1 is also prime; for example, 
5 > 11 > 23 — 47. More generally, say that q is a successor of p if p and q are both 
prime and q = 2"p + 1 for some k > 0. For example, 2 > 3 > 7 > 29 > 59 > 1889 > 
3779 — 7559 — 4058207223809 — 32465657790473 — 4462046030502692971872257 > 
95(30 omitted digits)37 — ---; the smallest successor of 95...37 has 103 digits. 

Find the longest chain of successive primes that you can. 
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> 40. [M36] (A. Shamir.) Consider an abstract computer that can perform the opera- 
tions z + y, £ — y, x-y, and |a/y| on integers x and y of arbitrary length, in just one 
unit of time, no matter how large those integers are. The machine stores integers in a 
random-access memory and it can select different program steps depending on whether 
or not x = y, given x and y. The purpose of this exercise is to demonstrate that there 
is an amazingly fast way to factorize numbers on such a computer. (Therefore it will 
probably be quite difficult to show that factorization is inherently complicated on real 
machines, although we suspect that it is.) 


a) Find a way to compute n! in O(log n) steps on such a computer, given an integer 
value n > 2. [Hint: If A is a sufficiently large integer, the binomial coefficients 
(7) = _m!/(m — k)!k! can be computed readily from the value of (A + 1)™.] 

b) Show how to compute a number f(n) in O(log n) steps on such a computer, given 
an integer value n > 2, having the following properties: f(n) = n if n is prime, 
otherwise f(n) is a proper (but not necessarily prime) divisor of n. [Hint: If n 4 4, 
one such function f(n) is ged(m(n), n), where m(n) = min{m | m! mod n = 0} |] 

(As a consequence of (b), we can completely factor a given number n by doing only 

O(log n)? arithmetic operations on arbitrarily large integers: Given a partial factor- 

ization n = ni...m,r, each nonprime n; can be replaced by f(ni) - (ni/f(ni)) in 

>> O(log ni) = O(log n) steps, and this refinement can be repeated until all n; are prime.) 


> 41. [M28] (Lagarias, Miller, and Odlyzko.) The purpose of this exercise is to show 
that the number of primes less than N? can be calculated by looking only at the primes 
less than N?, and thus to evaluate 7(N?) in O(N?**) steps. 

Say that an “m-survivor” is a positive integer whose prime factors all exceed m; 
thus, an m-survivor remains in the sieve of Eratosthenes (exercise 8) after all multiples 
of primes < m have been sieved out. Let f(x, m) be the number of m-survivors that 
are < x, and let fx(x,m) be the number of such survivors that have exactly k prime 
factors (counting multiplicity). 

a) Prove that a( N°) = a(N) + f(N?, N) — 1 — fa(N?, N). 

b) Explain how to compute f2(N°, N) from the values of n(x) for x < N?. Use your 

method to evaluate f2(1000, 10) by hand. 

c) Same question as (b), but evaluate f(N*,N) instead of fo(N*,N). [Hint: Use 

the identity f(x, pj) = f(x,pj;-1) — f(x/p;,p;-1), where p; is the jth prime and 

po = 1.] 

d) Discuss data structures for the efficient evaluation of the quantities in (b) and (c). 
42. [M35] (H. W. Lenstra, Jr.) Given 0 < r < s < N with r L s and N L s, show 
that it is possible to find all divisors of N that are = r (modulo s) by performing 
O([N/s*]*/? logs) well-chosen arithmetic operations on (lg N)-bit numbers. [Hint: 
Apply exercise 4.5.3—49.] 


> 43. [M43] Let m = pq be an r-bit Blum integer as in Theorem 3.5P, and let Qm = 
{y | y = 2? mod m for some x}. Then Qm has (p + 1)(q + 1)/4 elements, and every 
element y E€ Qm has a unique square root x = yy such that z E€ Qm. Suppose G(y) 
is an algorithm that correctly guesses yy mod 2 with probability > i +e, when yis a 
random element of Qm. The goal of this exercise is to prove that the problem solved 
by G is almost as hard as the problem of factoring m. 


a) Construct an algorithm A(G, m, €, y, 6) that uses random numbers and algorithm G 
to guess whether a given integer y is in Qm, without necessarily computing vy. 
Your algorithm should guess correctly with probability > 1 — 6, and its running 
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time T(A) should be at most O(e~?(log 5~')T(G)), assuming that T(G) > r?. (If 
T(G) < r?, replace T(G) by (T(G) + r?) in this formula.) 
b) Construct an algorithm F(G,m,e) that finds the factors of m with expected 
running time T(F) = O(r?(e~® + e~*(log e')T(G))). 
Hints: For fixed y E Qm, and for 0 < v < m, let ru = v\/ymodm and rv = 
Tumod2. Notice that A(—v) + Av = 1 and A(vı +--+ + Un) = (Avı +e H Xun + 
| (Tv1 +++++TUm)/m]) mod 2. Furthermore we have 7($v) = $(Tv + mAv); here 4v 
stands for (1v) mod m. If +v € Qm we have r(+v) = Vv?y; therefore algorithm G 


gives us a way to guess Av for about half of all v. 


44. [M35] (J. Hastad.) Show that it is not difficult to find x when aio +ayxt air? + 
ai3z° = 0 (modulo m;), 0 < £ < mi, gcd(aio, aii, i2, 413, Mi) = 1, and m; > 10?” for 
1<i<7, ifm Lm; for 1 <i< yj < 7. (All variables are integers; all but x are 
known.) Hint: When L is any nonsingular matrix of real numbers, the algorithm of 
Lenstra, Lenstra, and Lovász [Mathematische Annalen 261 (1982), 515-534] efficiently 
finds a nonzero integer vector v = (v1,...,Un) such that length(vL) < Vn2” | det L|'/", 


45. [M41] (J. M. Pollard and Claus-Peter Schnorr.) Find an efficient way to solve the 
congruence 
x” —ay? = b (modulo n) 


for integers x and y, given integers a, b, and n with ab L n and n odd, even if the 
factorization of n is unknown. [Hint: Use the identity (x? — ay?) (£3 — ay?) = x£? — ay’, 
where x = %1%2 — ayiy2 and y = 21y2 + r2y1.] 
46. [HM30] (L. Adleman.) Let p be a rather large prime number and let a be a 
primitive root modulo p; thus, all integers b in the range 1 < b < p can be written 
b = a” mod p, for some unique n with 1 < n < p. 

Design an algorithm that almost always finds n, given b, in O(pf) steps for all 
€ > 0, using ideas similar to those of Dixon’s factoring algorithm. [Hint: Start by 
building a repertoire of numbers n; such that a” mod p has only small prime factors.] 


47. [M50] A certain literary quotation x = 2122, represented in ASCII code, has the 
enciphered value (xj mod N, z3 mod N) = 


(837 2e6cadf564a9ee347092daef c242058b8044228597 ebf 2326bbbf f 1583ea4200d895d9564d39229c79af8 
72a72e38bb92852a22679080e269c30690fab0eci9f78e9ef 8bae7 4b600F 4ebef 42a1dd5a6d806dc70b96de2 
bf4a6c7d2ebb51bfd156dd8ac3cb0aeicic38d76a3427bec3f 12af7d4d04314c0d8377a0c79db1ibif0cd1702 
2aabcdOf9f 1£9fb382313246de168bae6a28d177963a8ebe6023Ff 1chbd8632caee9604f63cba6e33ceblelbd 
4732a2973£5021e96e05e0da932b5b1d2bc618351ca584bb6e49255ba22dcad55ebd6b93a9c94d8749bb53be2 
90650878b17£4f e30bbb08453929a94a2efe3367e2cd92ea31a5e0d9F466870b16227 2e9e164e8c3238da519) 


in hexadecimal notation, where N is 


c97dicbcc3b67dibai97100df7dbd2d2864c4f ef 4a78e62ddd1423d97 2bc7a420f66046386462d260d68a8b2 
3£bf12354705d874£79c22698f 750c1b4435bc99174e58180bd18560a5c69c4eafb57 34467 9f 588f624ec18 
4c3e7098e65ac7b88Ff89el1 fadcdc3558c878dde6bc7 c32be57 c5e7 e8d95d697ad3c6c343485132dcbb74f411 


What is x? 


The problem of distinguishing prime numbers from composites, 
and of resolving composite numbers into their prime factors, 
is one of the most important and useful in all of arithmetic. 

. The dignity of science seems to demand that every aid to the solution 
of such an elegant and celebrated problem be zealously cultivated. 


— C. F. GAUSS, Disquisitiones Arithmeticz, Article 329 (1801) 
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4.6. POLYNOMIAL ARITHMETIC 


THE TECHNIQUES we have been studying apply in a natural way to many types 
of mathematical quantities, not simply to numbers. In this section we shall deal 
with polynomials, which are the next step up from numbers. Formally speaking, 
a polynomial over S is an expression of the form 


u(x) = unt” +++ + ux + uo, (1) 


where the coefficients un, ..., U1, Uo are elements of some algebraic system S, 
and the variable x may be regarded as a formal symbol with an indeterminate 
meaning. We will assume that the algebraic system S is a commutative ring with 
identity; this means that S admits the operations of addition, subtraction, and 
multiplication, satisfying the customary properties: Addition and multiplication 
are binary operations defined on S; they are associative and commutative, and 
multiplication distributes over addition. There is an additive identity element 0 
and a multiplicative identity element 1, such that a+0 = aanda-l=a 
for all a in S. Subtraction is the inverse of addition, but we assume nothing 
about the possibility of division as an inverse to multiplication. The polynomial 
Oar t™ +... + 0LPHI + una” +--+ ugr + uo is regarded as the same polynomial 
as (1), although its expression is formally different. 

We say that (1) is a polynomial of degree n and leading coefficient un if 
Un Æ 0; and in this case we write 


deg(u) =n, Llu) = Un. (2) 
By convention, we also set 
deg(0)=—oo,  &(0) = 0, (3) 


where “0” denotes the zero polynomial whose coefficients are all zero. We say 
that u(x) is a monic polynomial if its leading coefficient (wu) is 1. 

Arithmetic on polynomials consists primarily of addition, subtraction, and 
multiplication; in some cases, further operations such as division, exponentiation, 
factoring, and taking the greatest common divisor are important. Addition, 
subtraction, and multiplication are defined in a natural way, as though the 
variable x were an element of S: We add or subtract polynomials by adding or 
subtracting the coefficients of like powers of x. Multiplication is done by the rule 


r+s a 


(ipa? eee tig) (Ue? +: + Ug) = Werte "s+ + Wo, 


where 
Wk = UQUk + U1VUR—-1 ++ + UR—1U1 + URVO- (4) 


In the latter formula u; or vj are treated as zero if i >r or j >s. 

The algebraic system S$ is usually the set of integers, or the rational numbers; 
or it may itself be a set of polynomials (in variables other than 2), in which case 
(1) is a multivariate polynomial, a polynomial in several variables. Another 
important case occurs when the algebraic system S$ consists of the integers 0, 
1,..., m—1, with addition, subtraction, and multiplication performed mod m 
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(see Eq. 4.3.2-(11)); this is called polynomial arithmetic modulo m. Polyno- 
mial arithmetic modulo 2, when each of the coefficients is 0 or 1, is especially 
important. 

The reader should note the similarity between polynomial arithmetic and 
multiple-precision arithmetic (Section 4.3.1), where the radix b is substituted 
for x. The chief difference is that the coefficient uz of x” in polynomial arithmetic 
bears no essential relation to its neighboring coefficients uz+1, so the idea of 
“carrying” from one place to the next is absent. In fact, polynomial arithmetic 
modulo b is essentially identical to multiple-precision arithmetic with radix b, 
except that all carries are suppressed. For example, compare the multiplication 
of (1101)2 by (1011)2 in the binary number system with the analogous multipli- 
cation of £? + z? + 1 by z? +2+1 modulo 2: 


Binary system Polynomials modulo 2 
1101 1101 
x 1011 x 1011 
1101 1101 
1101 1101 
1101 1101 
10001111 1111111 


The product of these polynomials modulo 2 is obtained by suppressing all carries, 
and it is zê +a°+a4++a3+2?+a+4+1. If we had multiplied the same polynomials 
over the integers, without taking residues modulo 2, the result would have been 
x8 + 2 + xf + 323 + 2? + 24 1; again carries are suppressed, but in this case 
the coefficients can get arbitrarily large. 

In view of this strong analogy with multiple-precision arithmetic, it is unnec- 
essary to discuss polynomial addition, subtraction, and multiplication much fur- 
ther in this section. However, we should point out some aspects that often make 
polynomial arithmetic somewhat different from multiple-precision arithmetic in 
practice: There is often a tendency to have a large number of zero coefficients, 
and polynomials of huge degrees, so special forms of representation are desirable; 
see Section 2.2.4. Furthermore, arithmetic on polynomials in several variables 
leads to routines that are best understood in a recursive framework; this situation 
is discussed in Chapter 8. 

Although the techniques of polynomial addition, subtraction, and multi- 
plication are comparatively straightforward, several other important aspects of 
polynomial arithmetic deserve special examination. The following subsections 
therefore discuss division of polynomials, with associated techniques such as 
finding greatest common divisors and factoring. We shall also discuss the prob- 
lem of efficient evaluation of polynomials, namely the task of finding the value 
of u(x) when z is a given element of S, using as few operations as possible. The 
special case of evaluating x” rapidly when n is large is quite important by itself, 
so it is discussed in detail in Section 4.6.3. 

The first major set of computer subroutines for doing polynomial arithmetic 
was the ALPAK system [W. S. Brown, J. P. Hyde, and B. A. Tague, Bell System 
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Tech. J. 42 (1963), 2081-2119; 43 (1964), 785-804, 1547-1562]. Another early 
landmark in this field was the PM system of George Collins [CACM 9 (1966), 
578-589]; see also C. L. Hamblin, Comp. J. 10 (1967), 168-171. 


EXERCISES 

1. [10] If we are doing polynomial arithmetic modulo 10, what is 77+2 minus x” +5? 
What is 62? + «+3 times 5a? + 2? 

2. [17] True or false: (a) The product of monic polynomials is monic. (b) The 
product of polynomials of degrees m and n has degree m+n. (c) The sum of polynomials 
of degrees m and n has degree max(m, n). 


3. [M20] If each of the coefficients ur, ..., Wo, Us, ---, Vo in (4) is an integer satisfying 
the conditions |u;| < mı, |vj| < m2, what is the maximum absolute value of the product 
coefficients wg? 


4. [21] Can the multiplication of polynomials modulo 2 be facilitated by using the 
ordinary arithmetic operations on a binary computer, if coefficients are packed into 
computer words? 


5. [M21] Show how to multiply two polynomials of degree < n, modulo 2, with 
an execution time proportional to O(n'®*) when n is large, by adapting Karatsuba’s 
method (see Section 4.3.3). 


4.6.1. Division of Polynomials 


It is possible to divide one polynomial by another in essentially the same way 
that we divide one multiple-precision integer by another, when arithmetic is 
being done on polynomials over a field. A field S is a commutative ring with 
identity, in which exact division is possible as well as the operations of addition, 
subtraction, and multiplication; this means as usual that whenever u and v are 
elements of S, and v Æ 0, there is an element w in S such that u = vw. The 
most important fields of coefficients that arise in applications are 


a) the rational numbers (represented as fractions, see Section 4.5.1); 

b) the real or complex numbers (represented within a computer by means of 
floating point approximations; see Section 4.2); 

c) the integers modulo p where p is prime (where division can be implemented 
as suggested in exercise 4.5.2—16); 

d) rational functions over a field, that is, quotients of two polynomials whose 
coefficients are in that field, the denominator being monic. 


Of special importance is the field of integers modulo 2, whose only elements 
are 0 and 1. Polynomials over this field (namely polynomials modulo 2) have 
many analogies to integers expressed in binary notation; and rational functions 
over this field have striking analogies to rational numbers whose numerator and 
denominator are represented in binary notation. 

Given two polynomials u(x) and v(x) over a field, with v(x) 4 0, we can 
divide u(x) by v(x) to obtain a quotient polynomial q(x) and a remainder 
polynomial r(x) satisfying the conditions 


u(x) = q(x) v(x) + r(x), deg(r) < deg(v). (1) 
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It is easy to see that there is at most one pair of polynomials (a( x) r(x)) 
satisfying these relations; for if (qı (£), rı(x)) and (q2(£), r2(x)) both nee (1 
with respect to the same polynomials u(x) and v(x), then q(x) v(x) + ri(a) = 
go(x)v(a) +r2(x), so (a(x) — q(x) u(x) = rə(x)—rı(x). Now if qi(x) — q2(x) is 
nonzero, we have deg((qi—q2):v) = deg(qı—q2)+deg(v) > deg(v) > deg(r2—r1), 
a contradiction; hence qı (x) — q2(a) = 0 and ri (a) = re(z). 

The following algorithm, which is essentially the same as Algorithm 4.3.1D 
for multiple-precision division but without any concerns of carries, may be used 
to determine g(x) and r(x): 


Algorithm D (Division of polynomials over a field). Given polynomials 
U(x) = Um” +--+ + u£ + uo, U(£) = Unz” + +++ + Ux + vo 


over a field S, where vn 4 0 and m > n > 0, this algorithm finds the polynomials 


Qf) =Gm—nt™ +++ qo,  r(@) = raat’ *+---+79 
over S that satisfy (2). 
D1. [Iterate on k.] Do step D2 for k = m-n, m—n-—1,..., 0; then terminate 
the algorithm with (rn_1,.-.,70) = (Un-1,---,Uo)- 


D2. [Division loop.] Set qk < Un+k/Vn, and then set uj <— uj — qevj—x for 
j=n+k—1,n+k-—2,...,k. (The latter operation amounts to replacing 
u(x) by u(x) — qv*v(x), a polynomial of degree < n+ k.) J 


An example of Algorithm D appears below in (5). The number of arithmetic 
operations is essentially proportional to n(m—n+1). Note that explicit division 
of coefficients is done only at the beginning of step D2, and the divisor is 
always vn; thus if v(x) is a monic polynomial (with v, = 1), there is no actual 
division at all. If multiplication is easier to perform than division it will be 
preferable to compute 1/v,, at the beginning of the algorithm and to multiply 
by this quantity in step D2. 

We shall often write u(x) mod v(x) for the remainder r(x) in (1). 


Unique factorization domains. If we restrict consideration to polynomials 
over a field, we are not coming to grips with many important cases, such as 
polynomials over the integers or polynomials in several variables. Let us therefore 
now consider the more general situation that the algebraic system S of coefficients 
is a unique factorization domain, not necessarily a field. This means that S' is a 
commutative ring with identity, and that 


i) uv £0, whenever u and v are nonzero elements of S; 
ii) every nonzero element u of S is either a unit or has a “unique” representation 
as a product of primes pı, ..., pt: 


u= p.p t21 (2) 


A unit is an element that has a reciprocal, namely an element u such that wv = 1 
for some v in $; and a prime is a nonunit element p such that the equation p = qr 
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can be true only if either g or r is a unit. The representation (2) is to be unique 
in the sense that if pı ...p; = q1- - - qs, where all the p’s and q’s are primes, then 
s = t and there is a permutation 7, ...7; of {1,...,¢} such that pı = aqm; ---, 
Pt = Atqr, for some units a1, ..., a+. In other words, factorization into primes 
is unique, except for unit multiples and except for the order of the factors. 

Any field is a unique factorization domain, in which each nonzero element is 
a unit and there are no primes. The integers form a unique factorization domain 
in which the units are +1 and —1, and the primes are +2, +3, +5, +7, +11, etc. 
The case that S is the set of all integers is of principal importance, because it 
is often preferable to work with integer coefficients instead of arbitrary rational 
coefficients. 


One of the key facts about polynomials (see exercise 10) is that the poly- 
nomials over a unique factorization domain form a unique factorization domain. 
A polynomial that is prime in this domain is usually called an irreducible polyno- 
mial. By using the unique factorization theorem repeatedly, we can prove that 
multivariate polynomials over the integers, or over any field, in any number of 
variables, can be uniquely factored into irreducible polynomials. For example, 
the multivariate polynomial 90x23 — 120x?y + 182?yz — 24xry?z over the integers 
is the product of five irreducible polynomials 2 - 3 - a - (3x — 4y) - (5x + yz). The 
same polynomial, as a polynomial over the rationals, is the product of three 
irreducible polynomials (6x) - (3x — 4y) - (5a + yz); this factorization can also be 
written z - (90a — 120y) - (x + yz) and in infinitely many other ways, although 
the factorization is essentially unique. 

As usual, we say that u(x) is a multiple of v(x), and that v(x) is a divisor 
of u(x), if u(x) = v(a)q(x) for some polynomial q(x). If we have an algorithm to 
tell whether or not u is a multiple of v for arbitrary nonzero elements u and v of a 
unique factorization domain S, and to determine w if u = v-w, then Algorithm D 
gives us a method to tell whether or not u(x) is a multiple of v(x) for arbitrary 
nonzero polynomials u(x) and v(x) over S. For if u(x) is a multiple of v(x), it is 
easy to see that u,4, must be a multiple of v, each time we get to step D2, hence 
the quotient u(x)/v(x) will be found. Applying this observation recursively, we 
obtain an algorithm that decides if a given polynomial over S, in any number of 
variables, is a multiple of another given polynomial over S, and the algorithm 
will find the quotient when it exists. 

A set of elements of a unique factorization domain is said to be relatively 
prime if no prime of that unique factorization domain divides all of them. A 
polynomial over a unique factorization domain is called primitive if its coefficients 
are relatively prime. (This concept should not be confused with the quite 
different idea of “primitive polynomials modulo p” discussed in Section 3.2.2.) 
The following fact, introduced for the case of polynomials over the integers by 
C. F. Gauss in article 42 of his celebrated book Disquisitiones Arithmeticee 
(Leipzig: 1801), is of prime importance: 


Lemma G (Gauss’s Lemma). The product of primitive polynomials over a 
unique factorization domain is primitive. 
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Proof. Let u(x) = tmx” +--+ + uo and v(x) = Unz” +--+ + vo be primitive 
polynomials. If p is any prime of the domain, we must show that p does not 
divide all the coefficients of u(x)v(x). By assumption, there is an index j such 
that uj is not divisible by p, and an index k such that vz is not divisible by p. 
Let j and k be as small as possible; then the coefficient of 27+* in u(x)v(x) is 


UjUk F Uj41Uk—-1 b+ Uj+RVO + Uj—1Uk+1 Ft  UNVE+ 55 


and it is easy to see that this is not a multiple of p (since its first term isn’t, but 
all of its other terms are). | 


If a nonzero polynomial u(x) over a unique factorization domain S is not 
primitive, we can write u(x) = pı - ui(x), where pı is a prime of S dividing all 
the coefficients of u(x), and where u(x) is another nonzero polynomial over S. 
All of the coefficients of u(x) have one less prime factor than the corresponding 
coefficients of u(x). Now if u(x) is not primitive, we can write u1(x) = p2-ua(z), 
etc.; this process must ultimately terminate in a representation u(x) = c-ug(x), 
where c is an element of S and u(x) is primitive. In fact, we have the following 
companion to Lemma G: 


Lemma H. Any nonzero polynomial u(x) over a unique factorization domain S 
can be factored in the form u(x) = c- v(x), where c is in S and v(x) is primitive. 
Furthermore, this representation is unique, in the sense that if u = cy: v1(x“) = 
c2: v2(x), then cy = acz and v2(x%) = avı (x) where a is a unit of S. 


Proof. We have shown that such a representation exists, so only the uniqueness 
needs to be proved. Assume that cı - v(x) = c2 + v2(a), where v(x) and v2(x) 
are primitive. Let p be any prime of S. If p? divides cı, then p% also divides co; 
otherwise p* would divide all the coefficients of cə - v2(x), so p would divide 
all the coefficients of v(x), a contradiction. Similarly, p* divides cz only if p* 
divides cı. Hence, by unique factorization, cy = aco where a is a unit; and 
0 = acz + v(x) — cg + v2(x) = c2 - (avı (x) — v9(x)), so avi (x) — ve(x)=0. | 


Therefore we may write any nonzero polynomial u(x) as 
u(x) = cont(u) - pp(u(z)), (3) 


where cont(u), the content of u, is an element of S, and pp(u(z)), the primitive 
part of u(x), is a primitive polynomial over S. When u(x) = 0, it is convenient 
to define cont(u) = pp(u(x)) = 0. Combining Lemmas G and H gives us the 
relations 

cont(u - v) = a cont(w) cont(v), 


pp(u(x) - v(x)) = b pp(u(x)) pp(v(2)), a 
where a and b are units, depending on the way contents are calculated, with 
ab = 1. When we are working with polynomials over the integers, the only units 
are +1 and —1, and it is conventional to define pp(u(x)) so that its leading 
coefficient is positive; then (4) is true with a = b = 1. When working with 
polynomials over a field we may take cont(u) = ¢(u), so that pp(u(x)) is monic; 
in this case again (4) holds with a = b = 1, for all u(x) and v(x). 
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For example, if we are dealing with polynomials over the integers, let u(x) = 
—26x? + 39 and u(x) = 21a +14. Then 


cont(u) = —13, pp(u(z)) = 2r? — 3, 
cont(v) = +7, pp(v(x)) = 3a + 2, 
cont(u:v)=—91, — pp(u(x)- v(x)) = 62° + 4a” — 9x — 6 


Greatest common divisors. When there is unique factorization, it makes 
sense to speak of a greatest common divisor of two elements; this is a common 
divisor that is divisible by as many primes as possible. (See Eq. 4.5.2-(6).) Since 
a unique factorization domain may have many units, however, there is ambiguity 
in this definition of greatest common divisor; if w is a greatest common divisor 
of u and v, so is a- w, when a is any unit. Conversely, the assumption of unique 
factorization implies that if w; and we2 are both greatest common divisors of u 
and v, then w; = a- w2 for some unit a. In other words it does not make sense, 
in general, to speak of “the” greatest common divisor of u and v; there is a set 
of greatest common divisors, each one being a unit multiple of the others. 

Let us now consider the problem of finding a greatest common divisor of 
two given polynomials over an algebraic system S, a question originally raised by 
Pedro Nuiiez in his Libro de Algebra (Antwerp: 1567). If S is a field, the problem 
is relatively simple; our division algorithm, Algorithm D, can be extended to an 
algorithm that computes greatest common divisors, just as Euclid’s algorithm 
(Algorithm 4.5.2A) yields the greatest common divisor of two given integers 
based on a division algorithm for integers: 


If v(x) = 0, then ged(u(x), v(x)) = u(x); 
otherwise gcd(u(x),u(x)) = gcd(v(z),r(2)), 


where r(x) is given by (1). This procedure is called Euclid’s algorithm for 
polynomials over a field. It was first used by Simon Stevin in L’Arithmetique 
(Leiden: 1585); see A. Girard, Les Œuvres Mathématiques de Simon Stevin 1 
(Leiden: 1634), 56. 

For example, let us determine the ged of z8 + 2° + 1024+ 10x? + 8x? +22+8 
and 32° +524+92?+42+8, mod 13, by using Euclid’s algorithm for polynomials 
over the integers modulo 13. First, writing only the coefficients to show the steps 
of Algorithm D, we have 


907 

3050948)10101010 828 
1060 310 7 (5) 
08070128 5 

80 9 01124 

011 0 304 


so that z8 + zê + 10x24 + 10x3 + 82? + 2x + 8 equals 


(9x +7) (32 + 5a* +92? + 4r +8) + (11t + 3z? + 4). 
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Similarly, 
32° + 5x4 + On? + 4r +8 = (5a? + 5)(11z4 + 3x? +4) + (4z +1); 
lat +32? +4 = (6x? + 52? + 6x +5)(4r +1) + 12; 
4x +1 = (9x + 12) -12 + 0. (6) 


(The equality sign here means congruence modulo 13, since all arithmetic on 
the coefficients has been done mod 13.) This computation shows that 12 is 
a greatest common divisor of the two original polynomials. Now any nonzero 
element of a field is a unit of the domain of polynomials over that field, so 
it is conventional in the case of fields to divide the result of the algorithm by 
its leading coefficient, producing a monic polynomial that is called the greatest 
common divisor of the two given polynomials. The gcd computed in (6) is 
accordingly taken to be 1, not 12. The last step in (6) could have been omitted, 
for if deg(v) = 0, then ged(u(x), v(x)) = 1, no matter what polynomial is chosen 
for u(x). Exercise 4 determines the average running time for Euclid’s algorithm 
on random polynomials modulo p. 

Let us now turn to the more general situation in which our polynomials are 
given over a unique factorization domain that is not a field. From Eqs. (4) we 
can deduce the important relations 


cont (gcd(u, v)) = a- gcd(cont(u), cont(v)), 
pp(ged(u(x), o(x))) = b- ged(pp(u(x)), pp(v(z))), 


where a and b are units. Here gcd(u(x), v(x)) denotes any particular polynomial 
in x that is a greatest common divisor of u(x) and v(x). Equations (7) reduce 
the problem of finding greatest common divisors of arbitrary polynomials to the 
problem of finding greatest common divisors of primitive polynomials. 

Algorithm D for division of polynomials over a field can be generalized to a 
pseudo-division of polynomials over any algebraic system that is a commutative 
ring with identity. We can observe that Algorithm D requires explicit division 
only by (v), the leading coefficient of v(x), and that step D2 is carried out 
exactly m — n + 1 times; thus if u(x) and v(x) start with integer coefficients, 
and if we are working over the rational numbers, then the only denominators 
that appear in the coefficients of q(x) and r(x) are divisors of @(v)"~"*1. This 
suggests that we can always find polynomials q(x) and r(x) such that 


ev)" Hux) = g(x)v(x) + r(x),  deg(r) < n, (8) 


(7) 


where m = deg(u) and n = deg(v), for any polynomials u(x) and v(x) Æ 0, 
provided that m > n. 


Algorithm R (Pseudo-division of polynomials). Given polynomials 
u(x) = UmT™ +--+ uz + uo, v(T) = Une” +--+ 14+ vo, 


where v, # 0 and m > n > 0, this algorithm finds polynomials q(x) = 
Im—nw™—" +++» + qo and r(x) = rp_1x"-1 +--+ ro satisfying (8). 


426 ARITHMETIC 4.6.1 


R1. [Iterate on k.] Do step R2 for k =m—n, m—n-—1, ..., 0; then terminate 
the algorithm with (7p_1,.--,10) = (Un—1,---;Uo)- 


R2. [Multiplication loop.] Set qx + th a0 and set Uj + Unt; — Un+kUj—k for 
j=nt+k—-1,n+k-—2,..., 0. (When j < k this means that uj + Unuj, 
since we treat v_1, V_2,... as zero. These multiplications could have been 
avoided if we had started the algorithm by replacing u, by v?’~"~'u,, for 
O<t<m-—n.) I 


An example calculation appears below in (10). It is easy to prove the validity 
of Algorithm R by induction on m—n, since each execution of step R2 essentially 
replaces u(x) by £(v) u(x) — €(u)x*v(a), where k = deg(u) — deg(v). Notice that 
no division whatever is used in this algorithm; the coefficients of g(a) and r(x) 
are themselves certain polynomial functions of the coefficients of u(x) and v(x). 
If ún = 1, the algorithm is identical to Algorithm D. If u(x) and v(x) are 
polynomials over a unique factorization domain, we can prove as before that the 
polynomials q(x) and r(x) are unique; therefore another way to do the pseudo- 
division over a unique factorization domain is to multiply u(x) by v™—"*+ and 
apply Algorithm D, knowing that all the quotients in step D2 will exist. 


Algorithm R can be extended to a “generalized Euclidean algorithm” for 
primitive polynomials over a unique factorization domain, in the following way: 
Let u(x) and v(x) be primitive polynomials with deg(u) > deg(v), and determine 
the polynomial r(x) satisfying (8) by means of Algorithm R. Now we can prove 
that gcd(u(x), v(x)) = gcd(v(x),r(x)): Any common divisor of u(x) and v(x) 
divides v(x) and r(x); conversely, any common divisor of v(x) and r(x) divides 
é(v)™-"*1 u(x), and it must be primitive (since v(x) is primitive) so it divides 
u(x). If r(x) = 0, we therefore have ged (u(x), v(x)) = v(x); on the other hand if 
r(x) #0, we have ged(v(x),r(x)) = ged(v(x), pp(r(a))) since v(x) is primitive, 
so the process can be iterated. 


Algorithm E (Generalized Euclidean algorithm). Given nonzero polynomials 
u(x) and v(x) over a unique factorization domain S, this algorithm calculates a 
greatest common divisor of u(x) and u(x). We assume that auxiliary algorithms 
exist to calculate greatest common divisors of elements of S, and to divide a by b 
in S when b 4 0 and ais a multiple of b. 


E1. [Reduce to primitive.] Set d + gcd(cont(u),cont(v)), using the assumed 
algorithm for calculating greatest common divisors in S. (By definition, 
cont(u) is a greatest common divisor of the coefficients of u(x).) Replace 
u(x) by the polynomial u(x)/cont(u) = pp(u(z)); similarly, replace v(x) 
by pp(v(2)). 

E2. [Pseudo-division.] Calculate r(x) using Algorithm R. (It is unnecessary to 
calculate the quotient polynomial q(x).) If r(x) = 0, go to E4. If deg(r) = 0, 
replace u(x) by the constant polynomial “1” and go to E4. 

E3. [Make remainder primitive.) Replace u(x) by v(x) and replace v(x) by 
pp(r(a)). Go back to step E2. (This is the “Euclidean step,” analogous 
to the other instances of Euclid’s algorithm that we have seen.) 
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E4. [Attach the content.] The algorithm terminates, with d- v(x) as the desired 
answer. I 
As an example of Algorithm E, let us calculate the gcd of the polynomials 
u(x) = xÈ + xê — 8x4 — 3x3 + 8x? + Qe — 5, 
v(x) = 82° + 5x4 — 4a? — 9x + 21, (9) 


over the integers. These polynomials are primitive, so step El sets d «+ 1. In 
step E2 we have the pseudo-division 


1 0 -6 
3050-4-921)10 10 -3-3 8 2 —5 
30 30 -9-924 6 —15 
30 50 —4 -9 21 
0-20 -5 0 3 6 -15 
0 -60-15 0 918 —45 (10) 
0 00 0000 40 
—6 0 —15 0 918 —45 
—18 0 —45 0 27 54 —135 
—18 0 —30 0 24 54 —126 


-15 0 3 0 -9 
Here the quotient q(x) is 1- 37x? +0-3'x + —6-3°; we have 
27u(x) = v(x) (9x? — 6) + (—1524 + 3x? — 9). (11) 


Now step E3 replaces u(x) by v(x) and v(x) by pp(r(x)) = 5x4 — 2? + 3. The 
subsequent calculation is summarized in the following table, where only the 
coefficients are shown: 


u(x) v(x) r(x) 
1,0,1,0, —3, —3,8,2,—5 3,0,5,0, —4, —9, 21 —15, 0, 3,0, —9 
3,0, 5, 0, —4, —9, 21 5,0, —1,0,3 —585, —1125, 2205 
5,0, —1,0,3 13,25, —49  —233150, 307500 
13, 25, —49 4663, —6150 143193869 (12) 


It is instructive to compare this calculation with the computation of the 
same greatest common divisor over the rational numbers, instead of over the 
integers, by using Euclid’s algorithm for polynomials over a field as described 
earlier in this section. The following surprisingly complicated sequence appears: 


u(x) v(x) 
1,0, 1,0, —3, —3,8,2,—5 3,0,5,0, —4, —9, 21 
5 1 1 
3, 0, 5, 0, —4, —9, 21 —9:9,9,0,—3 
5 1 1 117 441 
— 979, 9,0, -3 a5 9) 35 
_ 117 —9 441 233150 _ 102500 
25? ? 25 19773 ? 6591 
233150 _ 102500 _ 1288744821 (a ) 
19773 ? 6591 543589225 3 


428 ARITHMETIC 4.6.1 


To improve that algorithm, we can reduce u(x) and u(x) to monic polynomi- 
als at each step, since this removes unit factors that make the coefficients more 
complicated than necessary; this is actually Algorithm E over the rationals: 


u(x) v(x) 
1,0, 1,0, —3, —3, 8, 2, —5 1,0, 3,0, —4, —3,7 
5 4 1 3 
1, 0, 399) — 3.97 1,0, oes 
1,0, —5,0, 5 1, 13°? 13 
1, 25 _ 49 1, — 6150 
713°? sia ’? 4663 
1, — 3663 1 (14) 


In both (13) and (14) the sequence of polynomials is essentially the same 
as (12), which was obtained by Algorithm E over the integers; the only differ- 
ence is that the polynomials have been multiplied by certain rational numbers. 
Whether we have 5a*—2?+3 or —3.a*4 $2?—$ or x*— tgr? +3, the computations 
are essentially the same. But either algorithm using rational arithmetic tends 
to run slower than the all-integer Algorithm E, since rational arithmetic usually 
requires more evaluations of integer gcds within each step when the polynomials 
have large degree. 

It is instructive to compare (12), (13), and (14) with (6) above, where we 
determined the gcd of the same polynomials u(x) and v(x) modulo 13 with 
considerably less labor. Since (u) and (v) are not multiples of 13, the fact 
that ged(u(x),v(x)) = 1 modulo 13 is sufficient to prove that u(x) and v(x) 
are relatively prime over the integers (and therefore over the rational numbers). 
We will return to this time-saving observation at the close of Section 4.6.2. 


The subresultant algorithm. An ingenious algorithm that is generally supe- 
rior to Algorithm E, and that gives us further information about Algorithm E’s 
behavior, was discovered by George E. Collins [JACM 14 (1967), 128-142] and 
subsequently improved by W. S. Brown and J. F. Traub [JACM 18 (1971), 505- 
514; see also W. S. Brown, ACM Trans. Math. Software 4 (1978), 237-249]. This 
algorithm avoids the calculation of primitive parts in step E3, dividing instead 
by an element of S that is known to be a factor of r(x): 


Algorithm C (Greatest common divisor over a unique factorization domain). 
This algorithm has the same input and output assumptions as Algorithm E, 
and has the advantage that fewer calculations of greatest common divisors of 
coefficients are needed. 


C1. [Reduce to primitive.] As in step E1 of Algorithm E, set d << gcd(cont(u), 
cont(v)), and replace (u(x), v(x)) by (pp(u(z)), pp(v(x))). Set g} h & 1. 
C2. [Pseudo-division.] Set 6 + deg(u) — deg(v). Calculate r(x) using Algo- 


rithm R. If r(x) = 0, go to C4. If deg(r) = 0, replace u(x) by the constant 
polynomial “1” and go to C4. 


C3. [Adjust remainder.] Replace the polynomial u(x) by u(x), and replace v(x) 
by r(a)/gh°. (At this point all coefficients of r(x) are multiples of ghè.) 
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Then set g 4+ &(u), h + h'~*g? and return to C2. (The new value of h will 
be in the domain S, even if 5 > 1.) 


C4. [Attach the content.] Return d- pp(v(a)) as the answer. J 


If we apply this algorithm to the polynomials (9) considered earlier, the 
following sequence of results is obtained at the beginning of step C2: 


u(x) v(x) g h 
1,0,1,0, —3, —3,8,2,—5 3,0,5,0, —4, —9, 21 1 1 
3,0,5,0, —4, —9, 21 —15, 0,3,0, —9 3 9 
—15, 0,3,0, —9 65,125, —245 -15 25 

65, 125, —245 —9326, 12300 65 169 (15) 


At the conclusion of the algorithm, r(x)/gh? = 260708. 

The sequence of polynomials consists of integral multiples of the polynomials 
in the sequence produced by Algorithm E. In spite of the fact that the polyno- 
mials are not reduced to primitive form, the coefficients are kept to a reasonable 
size because of the reduction factor in step C3. 

In order to analyze Algorithm C and to prove that it is valid, let us call the 
sequence of polynomials it produces u(x), u2(x), u3(a), ..., where u1 (x) = u(x) 
and u2(x) = v(x). Let 6; = nj — nj41 for j > 1, where n; = deg(u;); and let 


gı = hy = 1, gj = Luj), hj =h; 0j= ; gt * for j > 2. Then we have 
go) uy (x) = u(x) qi (x) + gıhf'us (2), n3 < na; 
g% t ua(x) = u3(x)qo(x) + goh3?ua(x), na < na; (16) 
gt uz(x) = ua(x) q3(a) + ggh3?us(2), ns < n4; 


and so on. The process terminates when ng}ı = deg(uk+1) < 0. We must 
show that u3(x), u4(x), ..., have coefficients in S, namely that the factors gjh® : 
exactly divide all pocmicieate of the remainders, and we must also show that the 
hj values all belong to S. The proof is rather involved, and it can be most easily 
understood by considering an example. 

Suppose, as in (15), that nı = 8, no = 6, ng = 4, n4 = 2, ns = 1, ng = 0, so 
that 6, = 52 = 63 = 2, 64 = 65 = 1. Let us write u(x) = agx? + ayx’ +--+ ao, 
U2(x) = bex? +b5x5 +++-+b0, ..., u(x) = e1x + e0, Ue(x) = fo, so that hy = 1, 
h2 = b2, h3 = c3 /be, ha = d3b2/cZ. In these terms it is helpful to consider the 
array shown in Table 1. For concreteness, let us assume that the coefficients 
of the polynomials are integers. We have deur (a ) = ue(x)qi(x) + uz(x); so if 
we multiply row As by bg and subtract appropriate multiples of rows B7, Beg, 
and Bs (corresponding to the coefficients of qı (x)) we will get row Cs. If we also 
multiply row A4 by bł and subtract multiples of rows Bg, Bs, and By, we get 
row Cy. In a similar way, we have cfu2(x) = us(x)qo(x) + bgua(x); so we can 
multiply row B by c3, subtract integer multiples of rows Cs, C4, and C3, then 
divide by bg to obtain row D3. 
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In order to prove that u4(x) has integer coefficients, let us consider the 


matrix 
Ag ag a7 ag a5 a4 a3 a2 a, ao 0O 0 


Aj 0 ag A7 a6 45 G4 a3 a2 Q1 QQ 


Bə 0 0 be bs b4 ba be bı bo O 
Bı 0 0 0 bę bs b4 b3 be bı bo 
Bo 0 0 0 0 bę bs b4 b3 bg by bo 
The indicated row operations and a permutation of rows will transform M into 


B, /bę bs ba bg bo bibo 0 0 0 0 
Bs O be bs ba b3 b2 bı bo O O O 

Bə 0 0 bs bs ba bz bo bı bo O O 

Bı 0 0 0 bę bs by bg bg bı bo O , 

Cy [ott Oto ey ey ee 00 (18) 
Cı 0 0 0 0 0 cc & CG co 0 

Co 0 0 0 0 0 0 & c3 co & co 

Do 000 00 0 0 0 dz di do 


Because of the way M’ has been derived from M, we must have 


bg - be - bg - (ci/bg) - det Mo = + det Må, 


if Mo and Mj represent any square matrices obtained by selecting eight corre- 
sponding columns from M and M’. For example, let us select the first seven 
columns and the column containing d1; then 


be - be - bg - (c}/be) - det = +b- c3 - di. 


0 0 0 0 bę bs b4 by 


Since bgc4 Æ 0, this proves that dı is an integer. Similarly, dz and do are integers. 

In general, we can show that u;+1(x) has integer coefficients in a similar 
manner. If we start with the matrix M consisting of rows An,—n; through Ao 
and Bn,-n, through Bo, and if we perform the row operations indicated in 
Table 1, we will obtain a matrix M’ consisting in some order of rows Bny—n; 
through By ,—n;41, then Cn,-n, through Chy—nj41, <- Pnj_2—n; through Pi, 
then Qn,;_,—-n,; through Qo, and finally Ro (a row containing the coefficients of 
uj41(x)). Extracting appropriate columns shows that 


(ge gn eg eae, . (gop ah jn 


Nj—2—nj nj—ı—nj+1 


x det Mo = Egy gg” "a <- 9j Ij 


Tt, (19) 
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Table 1 
COEFFICIENTS THAT ARISE IN ALGORITHM C 

Row Multiply | Replace 
name Row by by row 
As as a7 a6 a5 a4 a3 a2 ai ao 0 0 0 0 0 bè Cs 
Aa 0 ag a7 a6 a5 a4 a3 a2 ai ao 0 0 0 0 bè C4 
A3 0 0 ag a7 a6 a5 A4 a3 Q2 Qı ao 0 0 0 bè C3 
Ao 0 0 0 ag a7 a6 a5 a4 a3 a2 ai ao 0 0 bè C2 
Aj 0 0 0 0 ag a7 A6 as a4 Q3 a2 ai ao 0 ba Cy 
Ao 0 0 0 0 0 ag a7 a6 a5 a4 a3 Q2 a, ao be Co 
Br be bs ba b3 b2 b bo O DO O0 O0 0 0 0 

Be 0 bs bs ba b3 b2 bı bœ 0 0 0 0 0 O 

B; 0 0 bs bs bs b3 b2 bı bœ 0 0 0 0 O 

B4 0 0 0 b bs ba b3 b2 bi bœ 0 0 0 0 

B3 0 0 0 0 b bs ba b3 b2 b bb O 0 O ci /bé D3 
Bə 0 0 0 0 0 be bs ba ba bo by bœ 0 0 3 / be Dz 
Bı 0 0 0 0 0 0 b bs ba b3 b2 bi bo O 3 / be Dı 
Bo 0 0 0 0 0 0 0 be bs ba b3 b2 bi bo 3 / be Do 
Cs 0 0 0 0 cc c Cc | &o 0 0 0 0 0 

C4 0 0 0 0 0 C4 C3 C2 C1 Co 0 0 0 0 

C3 0 0 0 0 0 0 c ce c@ a o 0 0 0 

C2 0 0 0 0 0 0 0 ca cy c2 a co 0 0 

Cı 0o 0 0 0 0 0 0 0 aw &e @& a co 0 džbe/ à Ey 
Co | 0 0 0 0 0 0 0 0 0 & 8 & a © d3b6/ch Eo 
Ds 0 0 0 0 0 0 0 0 d di do 0 0 0 

De 0 0 0 0 0 0 0 0 0 d d do 0 0 

Dı 0 0 0 0 0 0 0 0 0 0 d d do 0 

Do | 0 0 0 0 0 0 0 0 0 0 0 d di do | e5c3/d3b§ Fo 
Ey 0 0 0 0 0 0 0 0 0 0 0 & e Q0 

Eo 0 0 0 0 0 0 0 0 0 0 0 0 & &æ 

Fo 0 0 0 0 0 0 0 0 0 0 0 0 0 fo 


where r; is a given coefficient of uj+ı(x) and Mp is a submatrix of M. The h’s 
have been chosen very cleverly so that this equation simplifies to 


Mo = ri 


det 


(20) 


(see exercise 24). Therefore every coefficient of uj+ı(x) can be expressed as the 
determinant of an (nı +n2— 2n; +2) x (nı +n2g—2n; +2) matrix whose elements 
are coefficients of u(x) and u(x). 

It remains to be shown that the cleverly chosen h’s also are integers. A 
similar technique applies: Let’s look, for example, at the matrix 


Ay 
Ao 
B3 
Bə 
Bı 
Bo 


ag a7 
0 ag 
bs bs 
0 be 
0 0 b 
0 0 0 


a5 
a6 


a4 
a5 


a3 a2 


a4 a3 


ay 
ag 
0 
bo 
by 
b2 


ao 
a, ao 


0 


0 0 
0 0 


bo 
bı bo 


0 


= M. 
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Row operations as specified in Table 1, and permutation of rows, leads to 


B3 be bs ba bg b2 bı bo 0 0 O 
Bo O be bs ba bs bz bı bo O 0 
Bı O 0 bg bs by by bz bi bo O | ap (22) 
Bo 0 0 0 be bs ba b3 bz bı bo i 

Ci 0 0 0 0 | 6&3 c2 | Co 0 

Co 0 0 0 0 O0 & c3 &2 & Cp 


hence if we consider any submatrices Mo and Mé obtained by selecting six 
corresponding columns of M and M’ we have bg -bg-det Mo = + det Mj. When Mo 
is chosen to be the first six columns of M, we find that det My = +c7/bg = ths, 
so hg is an integer. 

In general, to show that h; is an integer for j > 3, we start with the matrix 
M consisting of rows An,—n;—1 through Ao and B,,-n,;—-1 through Bo; then we 
perform appropriate row operations until obtaining a matrix M’ consisting of 
rows By,-n,—-1 through B,, ,, then Cp,4n,-1 through Cy, _n,,---, Pnj_e—nj—1 
through Po, then Qnj_1—nj—1 through Qo. Letting Mo be the first nı +n2—2n,; 
columns of M, we obtain 


Ca hot )"2-"5 (9821 /gondayre—ma (g/g, E det Mo 


= gy ey e r (op ts. (23) 
an equation that neatly simplifies to 
det Mo = Æh;. (24) 


(This proof, although stated for the domain of integers, obviously applies to any 
unique factorization domain.) 

In the process of verifying Algorithm C, we have also learned that every 
element of S dealt with by the algorithm can be expressed as a determinant whose 
entries are the coefficients of the primitive parts of the original polynomials. A 
well-known theorem of Hadamard (see exercise 15) states that 


1/2 
|det(ai)| < [] ( 5 a) ; (25) 


l<i<n \1<j<n 


therefore every coefficient appearing in the polynomials computed by Algo- 
rithm C is at most 
NOT (m +1)? (n+ 1)", (26) 


if all coefficients of the given polynomials u(x) and v(x) are bounded by N 
in absolute value. This same upper bound applies to the coefficients of all 
polynomials u(x) and v(x) computed during the execution of Algorithm E, since 
the polynomials obtained in Algorithm E are always divisors of the polynomials 
obtained in Algorithm C. 

This upper bound on the coefficients is extremely gratifying, because it is 
much better than we would ordinarily have a right to expect. For example, 
consider what happens if we avoid the corrections in steps E3 and C3, merely 
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replacing v(x) by r(x). This is the simplest gcd algorithm, and it is the one 
that traditionally appears in textbooks on algebra (for theoretical purposes, not 
intended for practical calculations). If we suppose that 6, = d2 = --: = 1, we 
find that the coefficients of ug(x) are bounded by N?®, the coefficients of u4(zx) 
are bounded by N7, those of us(x) by N1, ...; the coefficients of uz (x) are 
bounded by N“*, where ak = 2ax— 1 + ax—2. Thus the upper bound, in place of 
(26) for m = n+ 1, would be approximately 


N2502-414)" (27) 
and experiments show that the simple algorithm does in fact have this behavior; 
the number of digits in the coefficients grows exponentially at each step! In 
Algorithm E, by contrast, the growth in the number of digits is only slightly 
more than linear at most. 

Another byproduct of our proof of Algorithm C is the fact that the degrees of 
the polynomials will almost always decrease by 1 at each step, so that the number 
of iterations of step C2 (or E2) will usually be deg(v) if the given polynomials 
are “random.” In order to see why this happens, notice for example that we 
could have chosen the first eight columns of M and M’ in (17) and (18); then 
we would have found that u4(x) has degree less than 3 if and only if d3 = 0, that 
is, if and only if 

ag a7 ag 45 Q4 Q3 Q2 Qı 
0 ag A7 a6 a5 a4 a3 AQ 
O O ag ay ag a5 a4 a3 
be bs b4 b3 b2 bı bo O 


det | O bs bs ba b3 bz bi bo 


In general, 46; will be greater than 1 for j > 1 if and only if a similar determinant 
in the coefficients of u(x) and v(x) is zero. Since such a determinant is a nonzero 
multivariate polynomial in the coefficients, it will be nonzero “almost always,” 
or “with probability 1.” (See exercise 16 for a more precise formulation of this 
statement, and see exercise 4 for a related proof.) The example polynomials in 
(15) have both 62 and 63 equal to 2, so they are exceptional indeed. 

The considerations above can be used to derive the well-known fact that 
two polynomials are relatively prime if and only if their resultant is nonzero; 
the resultant is a determinant having the form of rows As through Ag and B7 
through Bo in Table 1. (This is “Sylvester’s determinant”; see exercise 12. 
Further properties of resultants are discussed in B. L. van der Waerden, Modern 
Algebra, translated by Fred Blum (New York: Ungar, 1949), Sections 27-28.) 
From the standpoint discussed above, we could say that the gcd is “almost 
always” of degree zero, since Sylvester’s determinant is almost never zero. But 
many calculations of practical interest would never be undertaken if there weren’t 
some reasonable chance that the gcd would be a polynomial of positive degree. 
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We can see exactly what happens during Algorithms E and C when the gcd is 
not 1 by considering u(x) = w(x) u;(x) and v(x) = w(x)ue(#), where u(x) and 
u(x) are relatively prime and w(x) is primitive. Then if the polynomials u(x), 
u2(x), ug(x), ... are obtained when Algorithm E works on u(x) = u(x) and 
v(x) = u2(x), it is easy to see that the sequence obtained for u(x) = w(x) u(x) 
and v(x) = w(x)u2(z) is simply w(x)ui(a), w(x) ue(x), w(x)us(x), w(x)u4(zx), 
etc. With Algorithm C the behavior is different: If the polynomials u(x), 
u(x), uz(x), ... are obtained when Algorithm C is applied to u(x) = u1(x) and 
v(x) = ue(x), and if we assume that deg(uj+1) = deg(u;) — 1 (which is almost 
always true when j > 1), then the sequence 


w(x)ur(x), w(x) u2(x), €?2w(x)u3(x), C4w(x)us(x), CSw(r)us(x), ... (28) 


is obtained when Algorithm C is applied to u(x) = w(x)ui(x) and v(x) = 
w(x) u2(x), where £ = (w). (See exercise 13.) Even though these additional 
é-factors are present, Algorithm C will be superior to Algorithm E, because it is 
easier to deal with slightly larger polynomials than to calculate primitive parts 
repeatedly. 

Polynomial remainder sequences such as those in Algorithms C and E are 
not useful merely for finding greatest common divisors and resultants. Another 
important application is to the enumeration of real roots, for a given polynomial 
in a given interval, according to the famous theorem of J. Sturm [Mém. Présentés 
par Divers Savants 6 (Paris, 1835), 271-318]. Let u(x) be a polynomial over the 
real numbers, having distinct complex roots. We shall see in the next section 
that the roots are distinct if and only if gcd (u(x), w’ (x)) = 1, where wu’ (2) is the 
derivative of u(x); accordingly, there is a polynomial remainder sequence proving 
that u(x) is relatively prime to u/(a#). We set uo(x) = u(x), u(x) = u’ (x), and 
(following Sturm) we negate the sign of all remainders, obtaining 


c1Uo (x) = ur (x) qi(x) — dyuea(x), 
CoU1 (x) = U2(x)go(x) — dgus(x), 
(29) 


ChUg—1(@) = Un(X) qe (x) — deun+i (2), 


for some positive constants cj and dj, where deg(ux41) = 0. We say that the 
variation V(u,a) of u(x) at a is the number of changes of sign in the sequence 
uo(a), uila), ..., Uk4+i1(a@), not counting zeros. For example, if the sequence of 
signs is 0, +, —, —, 0, +, +, —, we have V(u,a) = 3. Sturm’s theorem asserts 
that the number of roots of u(x) in the interval a < x < b is V(u,a) — V (u,b); 
and the proof is surprisingly short (see exercise 22). 

Although Algorithms C and E are interesting, they aren’t the whole story. 
Important alternative ways to calculate polynomial gcds over the integers are 
discussed at the end of Section 4.6.2. There is also a general determinant- 
evaluation algorithm that may be said to include Algorithm C as a special case; 
see E. H. Bareiss, Math. Comp. 22 (1968), 565-578. 
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~ In the fourth edition of this book I plan to redo the exposition of the 

1 present section, taking into proper account the 19th-century research on 
determinants, as well as the work of W. Habicht, Comm. Math. Helvetici 21 
(1948), 99-116. An excellent discussion of the latter has been given by R. Loos 
in Computing, Supplement 4 (1982), 115-137. An interesting method for evalu- 
ating determinants, published in 1853 by Felice Chio and rediscovered by C. L. 
Dodgson (aka Lewis Carroll), is also highly relevant. See D. E. Knuth, Electronic 
J. Combinatorics 3, 2 (1996), paper #R5, §3, for a summary of the early history 
of identities between determinants of submatrices. 


EXERCISES 

1. [10] Compute the pseudo-quotient q(x) and pseudo-remainder r(x), namely the 
polynomials satisfying (8), when u(x) = zê + 2° — at + 2z? + 3a? — z +2 and v(x) = 
2x? + 2x? — x + 3, over the integers. 

2. [15] What is the greatest common divisor of 32° + 2° + 4a* + 4g? + 3x? + 4a + 2 
and its “reverse” 2x° + 42° + 3x + 4a? + 4x? + « +3, modulo 7? 


3. [M25] Show that Euclid’s algorithm for polynomials over a field S' can be extended 
to find polynomials U(x) and V(x) over S' such that 


u(x) V(x) + U(x) v(x) = ged(u(zx), v(x)). 


(See Algorithm 4.5.2X.) What are the degrees of the polynomials U(x) and V(x) that 
are computed by this extended algorithm? Prove that if S is the field of rational 
numbers, and if u(x) = 2” — 1 and v(x) = x” — 1, then the extended algorithm yields 
polynomials U(x) and V(x) having integer coefficients. Find U(x) and V(x) when 
u(a) = z” — 1 and v(x) = 23-1. 

4. [M30] Let p be prime, and suppose that Euclid’s algorithm applied to the poly- 
nomials u(x) and v(x) modulo p yields a sequence of polynomials having respective 
degrees M, n, ni, ..., Nt, —OO, where m = deg(u), n = deg(v), and nz > 0. Assume 
that m > n. If u(x) and v(x) are monic polynomials, independently and uniformly 
distributed over all the p™*” pairs of monic polynomials having respective degrees 
m and n, what are the average values of the three quantities t, nı + -< + nt, and 
(n — ni)ni +--+ + (nt-1 — nt) ne, as functions of m,n, and p? (These three quantities 
are the fundamental factors in the running time of Euclid’s algorithm applied to 
polynomials modulo p, assuming that division is done by Algorithm D.) [Hint: Show 
that u(x) mod v(x) is uniformly distributed and independent of v(x).] 


5. [M22] What is the probability that u(x) and v(x) are relatively prime modulo p, 
if u(x) and v(x) are independently and uniformly distributed monic polynomials of 
degree n? 

6. [M23] We have seen that Euclid’s Algorithm 4.5.2A for integers can be directly 
adapted to an algorithm for the greatest common divisor of polynomials. Can the 
binary gcd algorithm, Algorithm 4.5.2B, be adapted in an analogous way to an algo- 
rithm that applies to polynomials? 

7. [M10] What are the units in the domain of all polynomials over a unique factor- 
ization domain §? 

8. [M22] Show that if a polynomial with integer coefficients is irreducible over the 
domain of integers, it is irreducible when considered as a polynomial over the field of 
rational numbers. 
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9. [M25] Let u(x) and v(x) be primitive polynomials over a unique factorization 
domain S. Prove that u(x) and v(x) are relatively prime if and only if there are 
polynomials U(x) and V(x) over S such that u(x) V(x) + U(z)v(x) is a polynomial of 
degree zero. [Hint: Extend Algorithm E, as Algorithm 4.5.2A is extended in exercise 3.] 
10. [M28] Prove that the polynomials over a unique factorization domain form a 
unique factorization domain. [Hint: Use the result of exercise 9 to help show that 
there is at most one kind of factorization possible.] 


11. [M22] What row names would have appeared in Table 1 if the sequence of degrees 
had been 9, 6, 5, 2, —oo instead of 8, 6, 4, 2, 1, 0? 

12. [M24] Let ui(x), u2(x), ug(x), ... be a sequence of polynomials obtained during a 
run of Algorithm C. “Sylvester’s matrix” is the square matrix formed from rows An3—1 
through Ao and B,,,—1 through Bo (in a notation analogous to that of Table 1). Show 
that if ui(x) and u2(x) have a common factor of positive degree, then the determinant 
of Sylvester’s matrix is zero; conversely, given that deg(uz) = 0 for some k, show that 
the determinant of Sylvester’s matrix is nonzero by deriving a formula for its absolute 
value in terms of &(u;) and deg(u;), 1 < j < k. 


13. [M22] Show that the leading coefficient ¢ of the primitive part of gcd(u(z), v(x)) 
enters into Algorithm C’s polynomial sequence as shown in (28), when 6; = 62 =--- = 
ôk—ı = 1. What is the behavior for general 6;? 


14. [M29] Let r(x) be the pseudo-remainder when u(x) is pseudo-divided by v(x). If 
deg(w) > deg(v) + 2 and deg(v) > deg(r) + 2, show that r(x) is a multiple of (v). 


15. [M26] Prove Hadamard’s inequality (25). [Hint: Consider the matrix AA”\] 


16. [M22] Let f(z1,..., £n) be a multivariate polynomial that is not identically zero, 
and let r(S1,..., Sn) be the set of roots (£1,..., 8n) of f(x1,...,%n) = 0 such that 
zı E€ S1,..., tn E Sn. If the degree of f is at most d; < |.S;| in the variable xj, prove 


that 

Ir(Si, ee Sn)| < [Si]... [Sn] — (|S1| = dı) ns (ISh — dn) : 
Therefore the probability of finding a root at random, |r($1,...,5n)|/[Si|...|Snl, 
approaches zero as the sets S; get bigger. [This inequality has many applications 
in the design of randomized algorithms, because it provides a good way to test whether 
a complicated sum of products of sums is identically zero without expanding out all 
the terms. | 


17. [M82] (P. M. Cohn’s algorithm for division of string polynomials.) Let A be an 
alphabet, that is, a set of symbols. A string a on A is a sequence of n > 0 symbols, 
a = @1... Gn, where each a; is in A. The length of a, denoted by |a|, is the number n 
of symbols. A string polynomial on A is a finite sum U = J`, rk ax, where each rz is a 
nonzero rational number and each a, is a string on A; we assume that a; Æ ax when 
j Æ k. The degree of U, deg(U), is defined to be —oo if U = 0 (that is, if the sum is 
empty), otherwise deg(U) = max |a,|. The sum and product of string polynomials are 
defined in an obvious manner; thus, (37, rjaj)(% p Skk) = 05.475SkO7 8K, Where the 
product of two strings is obtained by simply juxtaposing them, after which we collect 
like terms. For example, if A = {a,b}, U = ab + ba — 2a — 2b, and V = a + b — 1, then 
deg(U) = 2, deg(V) = 1, V? = aa+ab+ba+ bb— 2a—2b+1, and V?—U = aa+bb+1. 
Clearly deg(UV) = deg(U) + deg(V), and deg(U + V) < max(deg(U), deg(V)), with 
equality in the latter formula if deg(U) 4 deg(V). (String polynomials may be regarded 
as ordinary multivariate polynomials over the field of rational numbers, except that the 
variables are not commutative under multiplication. In the conventional language of 
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pure mathematics, the set of string polynomials with the operations defined here is the 
“free associative algebra” generated by A over the rationals.) 


a) Let Qi, Q2, U, and V be string polynomials with deg(U) > deg(V) and such that 
deg(QiU — Q2V) < deg(QiU). Give an algorithm to find a string polynomial Q 
such that deg(U — QV) < deg(U). (Thus if we are given U and V such that 
QiU = Q2V + R and deg(R) < deg(QiU), for some Qı and Q2, then there is a 
solution to these conditions with Qi = 1.) 

b) Given that U and V are string polynomials with deg(V) > deg(QiU — Q2V) for 
some Q and Q2, show that the result of (a) can be improved to find a quotient Q 
such that U = QV + R, deg(R) < deg(V). (This is the analog of (1) for string 
polynomials; part (a) showed that we can make deg(R) < deg(U), under weaker 
hypotheses.) 

c) A homogeneous polynomial is one whose terms all have the same degree (length). 
If Uy, U2, Vi, V2 are homogeneous string polynomials with U;V, = U2V2 and 
deg(Vi) > deg(V2), show that there is a homogeneous string polynomial U such 
that U2 = UU and Vy = UVa. 

d) Given that U and V are homogeneous string polynomials with UV = VU, prove 
that there is a homogeneous string polynomial W such that U = rW™, V = sW” 
for some integers m, n and rational numbers r, s. Give an algorithm to compute 
such a W having the largest possible degree. (This algorithm is of interest, for 
example, when U = a and V = £ are strings satisfying aĝ = Ba; then W is 
simply a string y. When U = 2” and V = x”, the solution of largest degree is the 
string W = 28°") | so this algorithm includes a gcd algorithm for integers as a 
special case.) 


> 18. [M24] (Euclidean algorithm for string polynomials.) Let Vi and V2 be string 

polynomials, not both zero, having a common left multiple. (This means that there exist 
string polynomials U; and U2, not both zero, such that U;V, = U2V2.) The purpose 
of this exercise is to find an algorithm to compute their greatest common right divisor 
gcerd(Vi, V2) and their least common left multiple lclm(Vi, V2). The latter quantities 
are defined as follows: gerd(Vi, V2) is a common right divisor of Vj and Və (that is, 
Vi = Wi gcrd(V1, V2) and V2 = W2 gcrd(Vi, V2) for some W; and W2), and any common 
right divisor of Vi and V2 is a right divisor of gcrd(Vi, V2); lelm(Wi, V2) = Z1 Vı = Z2V2 
for some Z, and Z2, and any common left multiple of Vi and V2 is a left multiple of 
Iclm(V, V2). 

For example, let U; = abbbab + abbab — bbab + ab — 1, Vi = babab + abab + ab — b; 
Uz = abb + ab — b, V2 = babbabab + bababab + babab + abab — babb — 1. Then we 
have U,V, = U2V2 = abbbabbabab + abbabbabab + abbbababab + abbababab — bbabbabab + 
abbbabab — bbababab + 2abbabab — abbbabb + ababab — abbabb — bbabab — babab + bbabb 
abb — ab + b. For these string polynomials it can be shown that gcrd(Vi, V2) = ab +1, 
and Iclm(Vi, V2) = U1 V. 

The division algorithm of exercise 17 may be restated thus: If Vi and V2 are string 
polynomials, with V2 Æ 0, and if U; 4 0 and Uz satisfy the equation U;V; = U2Vo, 
then there exist string polynomials Q and R such that 


Vi = QVa + R, where deg(R) < deg(V2). 


It follows readily that Q and R are uniquely determined; they do not depend on the 
given U; and U2. Furthermore the result is right-left symmetric, in the sense that 


U2 =U1Q+ PR’, where deg(R’) = deg(U1) — deg(V2) + deg(R) < deg(U1). 
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Show that this division algorithm can be extended to an algorithm that computes 
Iclm(Vi, V2) and gerd(Vi, V2); in fact, the extended algorithm finds string polynomials 
Zı and Z2 such that Zı Vı + Z2V2 = gerd(Vi, V2). [Hint: Use auxiliary variables u1, u2, 
V1, V2, W1, W2, W1, Ws, Z1, Z2, 24, 24, whose values are string polynomials; start by 
setting uı <— U1, u2 «+ U2, vı + Vi, v2 + V2, and throughout the algorithm maintain 
the conditions 


Uiwı + U2we = u1, z1 Vi + z2 V2 = v1, 

Uiw, + Uzw = ue, zi Vi + 2V2 = v2, 
uizı — u2z1 = (—1)" U, wivi — wiv = (—1)” Vi, 
u1z2 + u2z3 = (—1)” U2, w2v1 + wev2 = (—1)"Ve 


at the nth iteration. This might be regarded as the “ultimate” extension of Euclid’s 
algorithm] 


19. [M39] (Common divisors of square matrices.) Exercise 18 shows that the concept 
of greatest common right divisor can be meaningful when multiplication is not commu- 
tative. Prove that any two n x n matrices A and B of integers have a greatest common 
right matrix divisor D. [Suggestion: Design an algorithm whose inputs are A and B, 
and whose outputs are integer matrices D, P, Q, X, Y, where A= PD, B = QD, and 


D = XA+YB | Find a greatest common right divisor of the matrices (; J and ie $), 


20. [M40] Investigate approximate polynomial gcds and the accuracy of Euclid’s al- 
gorithm: What can be said about calculation of the greatest common divisor of poly- 
nomials whose coefficients are floating point numbers? 


21. [M25] Prove that the computation time required by Algorithm C to compute the 
gcd of two nth degree polynomials over the integers is O(n‘ (log N n)?), if the coefficients 
of the given polynomials are bounded by N in absolute value. 


22. [M23] Prove Sturm’s theorem. [Hint: Some sign sequences are impossible.] 


23. [M22] Prove that if u(x) in (29) has deg(u) real roots, then we have deg(u;+1) = 
deg(u;) — 1 for0<j <k. 


24. [M21] Show that (19) simplifies to (20) and (23) simplifies to (24). 


25. [M24] (W. S. Brown.) Prove that all the polynomials u;(x) in (16) for j > 3 are 
multiples of gcd(¢(u), £(v)), and explain how to improve Algorithm C accordingly. 


26. [M26] The purpose of this exercise is to give an analog for polynomials of the fact 
that continued fractions with positive integer entries give the best approximations to 
real numbers (exercise 4.5.3-42). 

Let u(x) and v(x) be polynomials over a field, with deg(u) > deg(v), and let 
ai(x), a2(x), ... be the quotient polynomials when Euclid’s algorithm is applied to 
u(x) and v(x). For example, the sequence of quotients in (5) and (6) is 9z? +7, 5a7+5, 
6x? + 5x? + 62 +5, 9a +12. We wish to show that the convergents p,(x) /dn(«) of the 
continued fraction //a1(x), a2(x),...// are the “best approximations” of low degree to 
the rational function u(x)/u(x), where we have pn(xz) = Kn—i(a2(z),...,an(x)) and 
dn(x) = Kp(ai(z),..-,@n(x)) in terms of the continuant polynomials of Eq. 4.5.3-(4). 
By convention, we let po(x) = q-1(x) = 0, p-1(x) = qo(x) = 1. 

Prove that if p(x) and q(x) are polynomials such that deg(q) < deg(qn) and 
deg(pu — qu) < deg(pn—1u — qn—1v), for some n > 1, then p(x) = cpn-ı(x) and 
q(x) = cqn—1(x) for some constant c. In particular, each gn (x) is a “record-breaking” 
polynomial in the sense that no nonzero polynomial q(x) of smaller degree can make 
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the quantity p(x) u(x) — q(x) v(x), for any polynomial p(x), achieve a degree as small 
as Pn(x) u(x) — qn (x)v(x). 

27. [M23] Suggest a way to speed up the division of u(x) by v(x) when we know in 
advance that the remainder will be zero. 


*4.6.2. Factorization of Polynomials 


Let us now consider the problem of factoring polynomials, not merely finding 
the greatest common divisor of two or more of them. 


Factoring modulo p. As in the case of integer numbers (Sections 4.5.2, 4.5.4), 
the problem of factoring seems to be more difficult than finding the greatest 
common divisor. But factorization of polynomials modulo a prime integer p is 
not as hard to do as we might expect. It is much easier to find the factors of an 
arbitrary polynomial of degree n, modulo 2, than to use any known method to 
find the factors of an arbitrary n-bit binary number. This surprising situation 
is a consequence of an instructive factorization algorithm discovered in 1967 by 
Elwyn R. Berlekamp [Bell System Technical J. 46 (1967), 1853-1859]. 

Let p be a prime number; all arithmetic on polynomials in the following 
discussion will be done modulo p. Suppose that someone has given us a polyno- 


mial u(x), whose coefficients are chosen from the set {0,1,...,p —1}; we may 
assume that u(x) is monic. Our goal is to express u(x) in the form 
u(x) = pi(a)... pr(x), (1) 
where p(x), ..., p-(a) are distinct, monic, irreducible polynomials. 
As a first step, we can use a standard technique to determine whether any 
of the exponents e1, ..., €r are greater than unity. If 
u(£) = tnt” +--+ + Up = v(x)? w(x), (2) 


then the derivative (formed in the usual way, but modulo p) is 
ul (x) = nunt”! +--+ +u = 20(x)0' (x) w(x) + v(x) w (£), (3) 


and this is a multiple of the squared factor v(x). Therefore our first step in 
factoring u(x) is to form 


ged(u(z), u (x)) = d(x). (4) 
If d(x) is equal to 1, we know that u(x) is squarefree, the product of distinct 
primes pı(x)...pp(x). If d(x) is not equal to 1 and d(x) # u(x), then d(x) is a 
proper factor of u(x); the relation between the factors of d(x) and the factors of 
u(x)/d(x) speeds up the factorization process nicely in this case (see exercises 34 
and 36). Finally, if d(x) = u(x), we must have u’ (x) = 0; hence the coefficient ux 
of z" is nonzero only when k is a multiple of p. This means that u(x) can be 
written as a polynomial of the form v(x”), and in such a case we have 


u(x) = v(x”) = (v(2))’; (5) 
the factorization process can be completed by finding the irreducible factors 
of v(x) and raising them to the pth power. 
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Identity (5) may appear somewhat strange to the reader; it is an important 
fact that is basic to Berlekamp’s algorithm and to several other methods we 
shall discuss. We can prove it as follows: If vı (a) and v2(x) are any polynomials 
modulo p, then 


(u1(x) +¥2(ax))” = vi (x)? + (2) 01 (x)? v2(a) +--+ (71) v(x) v(x)”: +v2(x)? 


= v(x)” +v2(x)”, 


since the binomial coefficients (*), ..., (Fa ,) are all multiples of p. Furthermore 
if a is any integer, we have a? = a (modulo p) by Fermat’s theorem. Therefore 


when v(x) = Um2™ + Um_1z™* +--+ + vo, we find that 
v(x)? = (Um2™)? + (Um—12™ +)? + +++ + (vo)? 
= Umt”? + Um—10"™ DP +: +w = v(x"). 


The remarks above show that the problem of factoring a polynomial reduces 
to the problem of factoring a squarefree polynomial. Let us therefore assume that 


u(x) = pi (x) po(x) ...pr(x) (6) 


is the product of distinct primes. How can we be clever enough to discover the 
p;(x)’s when only u(x) is given? Berlekamp’s idea is to make use of the Chinese 
remainder theorem, which is valid for polynomials just as it is valid for integers 
(see exercise 3). If (s1, 82,...,5,) is any r-tuple of integers mod p, the Chinese 
remainder theorem implies that there is a unique polynomial v(x) such that 


v(x) = sı (modulo pı(£)), ..., v(x) = sr (modulo p,(z)), 
deg(v) < deg(pi) + deg(p2) + --- + deg(p,) = deg(u). 


The notation “g(x) = h(x) (modulo f(x))” that appears here has the same 
meaning as “g(xz) = h(x) (modulo f(x) and p)” in exercise 3.2.2-11, since we 
are considering polynomial arithmetic modulo p. The polynomial v(x) in (7) 
gives us a way to get at the factors of u(x), for if r > 2 and sı Æ s2, we will have 
ged(u(x), v(x) — sı) divisible by pi(a) but not by po(z). 

Since this observation shows that we can get information about the factors 
of u(x) from appropriate solutions u(x) of (7), let us analyze (7) more closely. 
In the first place we can observe that the polynomial v(x) satisfies the condition 

P 


v(x)? = sh = s; = v(x) (modulo p;(x)) for 1 < j < r; therefore 


v(x)? = v(x) (modulo u(z)), deg(v) < deg(u). (8) 


(7) 


In the second place we have the basic polynomial identity 
a? —¢=(x—0)(a—1)...(e—(p—1)) (modulo p) (9) 
(see exercise 6); hence 
v(x)” — v(x) = (v(x) — 0) (v(x) — 1)... (v(x) — (p — 1)) (10) 


is an identity for any polynomial u(x), when we are working modulo p. If v(x) 
satisfies (8), it follows that u(x) divides the left-hand side of (10), so every 
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irreducible factor of u(x) must divide one of the p relatively prime factors of the 
right-hand side of (10). In other words, all solutions of (8) must have the form 
of (7), for some s1, $2, ..., Sr; there are exactly p” solutions of (8). 

The solutions v(x) to congruence (8) therefore provide a key to the factor- 
ization of u(x). It may seem harder to find all solutions to (8) than to factor 
u(x) in the first place, but in fact this is not true, since the set of solutions to 
(8) is closed under addition. Let deg(u) = n; we can construct the n x n matrix 


40,0 90,1 see d0,n-1 
Q= (11) 
Qn—1,0 Gn—-1,1 +++) Qn—1,n-1 
where 
PE = gene") +++ + dka + dko (modulo u(zx)). (12) 


Then v(x) = vp_12"~! +--+ +12 + vo is a solution to (8) if and only if 


(vo, V1,---;Un—1)Q = (V0, U1,--+,Un—1)} (13) 
for the latter equation holds if and only if 
u(x“) = 5 vjzi = > > Vkqk jx = Dute = v(x?) = v(x)” (modulo u(z)). 
j 
Berlekamp’s SAN algorithm a proceeds as follows: 
B1. [Remove duplicate factors.] Ensure that u(x) is squarefree; in other words, if 


gcd(u(x), u'(x)) # 1, reduce the problem of factoring u(x), as stated earlier 
in this section. 


B2. [Get Q.] Form the matrix Q defined by (11) and (12). This can be done in 
different ways, depending on the size of p, as explained below. 


B3. [Find null space.) “Triangularize” the matrix Q — I, where I = (6;;) is the 
nxn identity matrix, finding its rank n—r and finding linearly independent 
vectors viH, ..., vl] such that v'](Q—TI) = (0,0,...,0) for 1 < j < r. (The 
first vector vl! may always be taken as (1,0,...,0), representing the trivial 
solution vl! (x) = 1 to (8). The computation can be done using appropriate 
column operations, as explained in Algorithm N below.) At this point, r is 
the number of irreducible factors of u(x), because the solutions to (8) are 
the p” polynomials corresponding to the vectors tiv”! + --- + t,v!"! for all 
choices of integers 0 < t1,...,tr < p. Therefore if r = 1 we know that u(x) 
is irreducible, and the procedure terminates. 

BA. [Split.] Calculate ged(u(x), vPl(x) — s) for 0 < s < p, where v?l(x) is 
the polynomial represented by vector v!?]. The result will be a nontrivial 
factorization of u(x), because v?l(a) — s is nonzero and has degree less than 
deg(u), and by exercise 7 we have 


= [Į scd(v(x) —s, u(x) (14) 
0<s<p 


whenever v(x) satisfies (8). 
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If the use of v?l(x) does not succeed in splitting u(x) into r factors, 
further factors can be obtained by calculating gcd(v!*l(x) — s, w(x)) for 
0 < s < p and all factors w(x) found so far, for k = 3, 4, ..., until r factors 
are obtained. (If we choose s; # sj in (7), we obtain a solution v(x) to (8) 
that distinguishes p;(x) from p;(x); some v!*l(x) — s will be divisible by 
pi(z) and not by p;(x), so this procedure will eventually find all of the 
factors.) 

If p is 2 or 3, the calculations of this step are quite efficient; but if p is 
more than 25, say, there is a much better way to proceed, as we shall see 
later. J 


Historical notes: M. C. R. Butler [Quart. J. Math. 5 (1954), 102-107] observed 
that the matrix Q— I corresponding to a squarefree polynomial with r irreducible 
factors will have rank n — r, modulo p. Indeed, this fact was implicit in a more 
general result of K. Petr [Časopis pro Pěstování Matematiky a Fysiky 66 (1937), 
85-94], who determined the characteristic polynomial of Q. See also Š. Schwarz, 
Quart. J. Math. 7 (1956), 110-124. 

As an example of Algorithm B, let us now determine the factorization of 


u(x) = 28 + x + 1024 + 102% + 82° + 22 +8 (15) 


modulo 13. (This polynomial appears in several of the examples in Section 4.6.1.) 
A quick calculation using Algorithm 4.6.1E shows that gcd(u(x),u'(a)) = 1; 
therefore u(x) is squarefree, and we turn to step B2. Step B2 involves calculating 
the Q matrix, which in this case is an 8 x 8 array. The first row of Q is always 
(1,0,0,...,0), representing the polynomial x° mod u(x) = 1. The second row 
represents x! mod u(x), and, in general, zë mod u(x) may readily be determined 
as follows (for relatively small values of k): If 


ula) = 2" + un”! +---+ ue + uo 


and if 


TÄ = apne" | +++ + apa2 + ako (modulo u(z)), 


then 


k+1— 2 
git) = ak n-18” +: + ak1’ + akor 


= n-1 n-1 
= ak n—-1(—Un—-12 <+- UG — U9) + Ak n—20 +++ + akot 


= (aaa +++ + ak+1, 1T + Ak+1,0, 
where 
Qk4+1,j = k, ,j—1 — Ak,n—-1Uj- (16) 
In this formula a;,—1 is treated as zero, so that a&kķ+1,0 = —@k,n—1Uo. The simple 
“shift register” recurrence (16) makes it easy to calculate x mod u(x) for k = 1, 


2, 3,..., (m—1)p. Inside a computer, this calculation is of course generally done 
by maintaining a one-dimensional array (an—1,. . - , @1, &0) and repeatedly setting 


t 4 Qn—1, An—1 & (Qn—2 — tUn—1) mod p, ..., a, < (ao — tu) mod p, 
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and ag + (—tuo) mod p. (We have seen similar procedures in connection with 
random number generation, 3.2.2-(10).) For the example polynomial u(x) 
in (15), we obtain the following sequence of coefficients of zë mod u(x), using 
arithmetic modulo 13: 


k ak,7 ak,6 Qk,5 Qk,4 ak,3 ak,2 Gk,1 Qk,0 
0 0 0 0 0 0 0 0 1 
1 0 0 0 0 0 0 1 0 
2 0 0 0 0 0 1 0 0 
3 0 0 0 0 1 0 0 0 
4 0 0 0 1 0 0 0 0 
5 0 0 1 0 0 0 0 0 
6 0 1 0 0 0 0 0 0 
7 1 0 0 0 0 0 0 0 
8 0 12 0 3 3 5 11 5 
9 12 0 3 3 5 11 5 0 
10 0 4 3 2 8 0 2 8 
11 4 3 2 8 0 2 8 0 
12 3 11 8 12 1 2 5 7 
13 11 5 12 10 11 7 1 2 
Therefore the second row of Q is (2,1,7,11,10,12,5,11). Similarly we may 

determine x?° mod u(x), ..., x°! mod u(x), and we find that 

1 0 0 0 0 0 0 0 

2 1 7 11 10 12 5 Il 

3 6 4 3 0 4 7 2 

be 4 3 6 5 1 6 2 8 

~12 11 8 8 3 1 38 1’ 

6 11 8 6 2 7 10 9 

5 11 7 10 0 11 7 12 

3 3 12 5 0 11 9 12 

(17) 

0 0 0 0 0 0 0 = 0 

2 0 7 11 10 12 5 Il 

3 6 3 3 0 4 7 2 

4 3 6 4 1 6 2 8 

gats 211 8 8 2 1 3 Il 

6 1 8 6 2 6 10 9 

5 11 7 10 0 11 6 12 

3 3 12 5 0 11 9 Il 


That finishes step B2; the next step of Berlekamp’s procedure requires 
finding the “null space” of Q — I. In general, suppose that A is an n x n 
matrix over a field, whose rank n — r is to be determined; suppose further that 
we wish to determine linearly independent vectors vl, vl], ..., vil such that 
vH A = VPA = --- = lA = (0,...,0). An algorithm for this calculation 
can be based on the observation that any column of A may be multiplied by 
a nonzero quantity, and any multiple of one of its columns may be added to a 
different column, without changing the rank or the vectors v, ..., v]. (These 
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transformations amount to replacing A by AB, where B is a nonsingular matrix.) 
The following well-known “triangularization” procedure may therefore be used. 


Algorithm N (Null space algorithm). Let A be an n x n matrix, whose 
elements aij belong to a field and have subscripts in the range 0 < i,j < n. 


This algorithm outputs r vectors vil, ..., vll, which are linearly independent 

over the field and satisfy vll A = (0,...,0), where n — r is the rank of A. 

N1. [Initialize] Set co — cı 4+ +++ © G1 «+ —1, r + 0. (During the 
calculation we will have cj; > 0 only if a.,; = —1 and all other entries 
of row cj are zero.) 

N2. [Loop on k.] Do step N3 for k = 0, 1, ..., n — 1, then terminate the 
algorithm. 

N3. [Scan row for dependence.]| If there is some j in the range 0 < j < n such 


that az; #0 and c; < 0, then do the following: Multiply column j of A by 
—1/axz; (so that az; becomes equal to —1); then add a,; times column j to 
column 7 for all i Æ j; finally set cj + k. (Since it is not difficult to show 
that as; = 0 for all s < k, these operations have no effect on rows 0, 1, ..., 
k—1 of A.) 

On the other hand, if there is no j in the range 0 < j < n such that 
akj #0 and c; < 0, then set r + r+ 1 and output the vector 


vil = (vo, Ulysse ,Un—1) 


defined by the rule 


aks, if Cs =j 20; 
vj = fi if j = k; (18) 

0, otherwise. Į 
An example will reveal the mechanism of this algorithm. Let A be the matrix 
Q — I of (17) over the field of integers modulo 13. When k = 0, we output the 
vector vl = (1,0,0,0,0,0,0,0). When k = 1, we may take j in step N3 to be 
either 0, 2, 3, 4, 5, 6, or 7; the choice here is completely arbitrary, although 
it affects the particular vectors that are chosen to be output by the algorithm. 
For hand calculation, it is most convenient to pick j = 5, since aj5 = 12 = —1 

already; the column operations of step N3 then change A to the matrix 


0 0 0 0 0 0 0 0 
0 0 0 0 0 ®@ 0 0 
ll 6 5 8 1 4 1 7 
3 3 9 5 9 6 6 4 
4 11 2 6 12 1 8 9 
5 11 11 7 10 6 1 10 
1 11 6 1 6 11 9 3 
12 3 11 9 6 11 12 2 


(The circled element in column “5”, row “1”, is used here to indicate that 
c5 = 1. Remember that Algorithm N numbers the rows and columns of the 
matrix starting with 0, not 1.) When k = 2, we may choose j = 4 and proceed 
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in a similar way, obtaining the following matrices, which all have the same null 
space as Q — T: 


k=2 k=3 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 @ 0 0 0 0 0 0 0 @ 0 0 
0 0 0 0 @ 0 0 0 0 0 0 0 @ 0 0 0 
8 1 3 11 4 9 10 6 0a 0 0 0 0 0 0 
2 4 7 1 1 5 9 3 9 9 8 9 ll 8 8 5 
12 3 0 5 3 5 4 5 1 10 4 11 4 4 0 O 
0 1 2 5 7 0 3 0 5 12 12 7 3 4 6 7 
ll 6 7 0 7 O 6 12 2 7 2 12 9 11 11 2 

k=4 k=5 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 @ 0 0 0 0 0 0 0 @ 0 0 
0 0 0 0 @ 0 0 0 0 0 0 04) 0 0 0 
0a 0 0 0 0 0 0 0a) 0 0 0 0 0 0 
0 0 0 0 0 0 0 ® 0 0 0 0 0 0 0 @ 
1 10 4 11 4 4 0 0 a) 0 0 0 0 0 0 0 
8 2 6 10 11 11 0 9 5 0 0 0 5 5 0 9 
1 6 4 11 2 0 O 10 12 9 0 0 11 9 O 10 


Now every column that has no circled entry is completely zero; so when k = 6 
and k = 7 the algorithm outputs two more vectors, namely 


vPl = (0,5,5,0,9,5,1,0), vl = (0,9, 11, 9, 10, 12, 0, 1). 


From the form of matrix A after k = 5, it is evident that these vectors satisfy 
the equation vA = (0,...,0). Since the computation has produced three linearly 
independent vectors, u(x) must have exactly three irreducible factors. 

Finally we can go to step B4 of the factoring procedure. The calculation of 
gcd(u(zx), vPI (x) — s) for 0 < s < 13, where vPI (x) = xô + 5x5 + 924+ 52? +52, 
gives 2° + 524 +923 + 52+5 as the answer when s = 0, and z? + 827 +424 12 
when s = 2; the gcd is unity for other values of s. Therefore v! (x) gives us only 
two of the three factors. Turning to gcd (vl? (£) — s, 2° + 504 + 903 + 5a + 5), 
where vl (x) = a7 +122°+1024+923+112?+92, we obtain the factor x4+2x3 + 
32? + 4x +6 when s = 6, x+ 3 when s = 8, and unity otherwise. Thus the 
complete factorization is 


u(x) = (x + 203 + 3x? + 4a + 6)(23 + 8x? + 4r 4 12)(2 +3). (a9) 


Let us now estimate the running time of Berlekamp’s method when an nth 
degree polynomial is factored modulo p. First assume that p is relatively small, 
so that the four arithmetic operations can be done modulo p in essentially a fixed 
length of time. (Division modulo p can be converted to multiplication, by storing 
a table of reciprocals as suggested in exercise 9; for example, when working 

1 


modulo 13, we have 5 = 7, $ = 9, etc.) The computation in step B1 takes O(n?) 
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units of time; step B2 takes O(pn?). For step B3 we use Algorithm N, which 
requires O(n?) units of time at most. Finally, in step B4 we can observe that 
the calculation of ged(f (x), g(x)) by Euclid’s algorithm takes O(deg(f) deg(g)) 
units of time; hence the calculation of ged(v'/I(x) — s, w(x)) for fixed j and s 
and for all factors w(x) of u(x) found so far takes O(n?) units. Step B4 therefore 
requires O(prn”) units of time at most. Berlekamp’s procedure factors an 
arbitrary polynomial of degree n, modulo p, in O(n? + prn?) steps, when p 
is a small prime; and exercise 5 shows that the average number of factors, r, is 
approximately Inn. Thus the algorithm is much faster than any known methods 
of factoring n-digit numbers in the p-ary number system. 

Of course, when n and p are small, a trial-and-error factorization procedure 
analogous to Algorithm 4.5.4A will be even faster than Berlekamp’s method. 
Exercise 1 implies that it is a good idea to cast out factors of small degree first 
when p is small, before going to any more complicated procedure, even when n 
is large. 

When p is large, a different implementation of Berlekamp’s procedure would 
be used for the calculations. Division modulo p would not be done with an 
auxiliary table of reciprocals; instead the method of exercise 4.5.2-16, which 
takes O((logp)*) steps, would probably be used. Then step B1 would take 
O(n? (log p)”) units of time; similarly, step B3 would take O(n3(log p)”). In step 
B2, we can form x? mod u(x) in a more efficient way than (16) when p is large: 
Section 4.6.3 shows that this value can be obtained by essentially using O(log p) 
operations of squaring mod u(x), going from zë mod u(x) to «?* mod u(x), to- 
gether with the operation of multiplying by x. The squaring operation is rel- 
atively easy to perform if we first make an auxiliary table of 2” mod u(x) for 


m=n,nt+1,..., 2n — 2; if rë mod u(z) = cg_12"-14+ +--+ cT + co, then 
x?” mod u(x) = (cy, a7"? +--+ + (c1co + c1e9)% + cg) mod u(z), 
where x?"~?, ..., x” can be replaced by polynomials in the auxiliary table. The 


total time to compute x? mod u(x) comes to O(n?(log p)*) units, and we obtain 
the second row of Q. To get further rows of Q, we can compute x°? mod u(x), 
x? mod u(x), ..., simply by multiplying repeatedly by x? mod u(x), in a fashion 
analogous to squaring mod u(x); step B2 is completed in O (n3 (log p)”) additional 
units of time. Thus steps B1, B2, and B3 take a total of O(n? (log p)3+n3 (log p)”) 
time units; these three steps tell us the number of factors of u(x). 

But when p is large and we get to step B4, we are asked to calculate a greatest 
common divisor for p different values of s, and that is out of the question if p is 
even moderately large. This hurdle was first surmounted by Hans Zassenhaus 
[J. Number Theory 1 (1969), 291-311], who showed how to determine all of the 
“useful” values of s (see exercise 14); but an even better way to proceed was 
found by Zassenhaus and Cantor in 1980. If v(x) is any solution to (8), we know 
that u(x) divides v(x)? — v(x) = v(x) - (v(x) @-D/? +1) - (v(x) ®-Y/? — 1). This 
suggests that we calculate 


gcd (u(x), v(x) @-D/2 — 1); (20) 
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with a little bit of luck, (20) will be a nontrivial factor of u(x). In fact, we can 
determine exactly how much luck is involved, by considering (7). Let v(x) = sj 
(modulo p;(x)) for 1 < j < r; then p;(x) divides v(x)-)/? — 1 if and only if 
ay Dai (modulo p). We know that exactly (p— 1)/2 of the integers s in the 
range 0 < s < p satisfy s?®71)/2 = 1 (modulo p), hence about half of the p;(x) 
will appear in the gcd (20). More precisely, if v(x) is a random solution of (8), 
where all p” solutions are equally likely, the probability that the gcd (20) equals 
u(x) is exactly 


((p — 1)/2p)", 


and the probability that it equals 1 is ((p + 1)/ 2p)". The probability that a 
nontrivial factor will be obtained is therefore 


p-1\" ptil\" 1 [T 2 r\ _4 4 
i ( 2p ) ( 2p ) =l- za (1 | (3)? +(e +) = 9 
for all r > 2 and p> 3. 

It is therefore a good idea to replace step B4 by the following procedure, 
unless p is quite small: Set v(x) + av! (x) + agul (x) +--+. +a,v!"l(a), where 
the coefficients a; are randomly chosen in the range 0 < a; < p. Let the current 
partial factorization of u(x) be u1(x)... u(x) where t is initially 1. Compute 


gi(x) = ged(ui(x), v(x) ®-2/? — 1) 


for all i such that deg(u;) > 1; replace u;(x) by gi(x)- (ui(x)/gi(x)) and increase 
the value of t, whenever a nontrivial gcd is found. Repeat this process for 
different choices of v(x) until t = r. 

If we assume (as we may) that only O(logr) random solutions v(x) to (8) 
will be needed, we can give an upper bound on the time required to perform 
this alternative to step B4. It takes O(rn(log p)”) steps to compute v(x); and if 
deg(u;) = d, it takes O(d?(log p)*) steps to compute v(x)®7)/2 mod u;(x) and 
O(d?(logp)”) further steps to compute ged(u;(x), v(z)®-)/? — 1). Thus the 
total time is O(n?(log p)? log r). 


Distinct-degree factorization. We shall now turn to a somewhat simpler way 
to find factors modulo p. The ideas we have studied so far in this section involve 
many instructive insights into computational algebra, so the author does not 
apologize to the reader for presenting them; but it turns out that the problem 
of factorization modulo p can actually be solved without relying on so many 
concepts. 

In the first place we can make use of the fact that an irreducible polynomial 
q(x) of degree d is a divisor of x?" — x, and it is not a divisor of «P°— a for 
1 < c < d; see exercise 16. We can therefore cast out the irreducible factors of 
each degree separately, by adopting the following strategy. 


D1. [Go squarefree.] Rule out squared factors, as in Berlekamp’s method. Also 
set v(x) + u(x), w(x) + “x”, and d + 0. (Here v(x) and w(x) are variables 
that have polynomials as oe 
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D2. {If not done, take pth power.] (At this point w(x) = z?” mod v(x); all of 
the irreducible factors of v(x) are distinct and have degree > d.) If d+1> 
3 deg(v), the procedure terminates since we either have v(x) = 1 or v(x) is 
irreducible. Otherwise increase d by 1 and replace w(x) by w(x)? mod v(x). 

D3. [Extract factors.) Find ga(£) = gcd(w(x) — x, v(x)). (This is the product 
of all the irreducible factors of u(x) whose degree is d.) If ga(x) 4 1, replace 
v(x) by v(a)/ga(a) and w(x) by w(x) mod u(x); and if the degree of ga(x) 
is greater than d, use the algorithm below to find its factors. Return to 
step D2. I 


This procedure determines the product of all irreducible factors of each 
degree d, and therefore it tells us how many factors there are of each degree. 
Since the three factors of our example polynomial (19) have different degrees, 
they would all be discovered without any need to factorize the polynomials ga(z). 

To complete the method, we need a way to split the polynomial gq(x) into 
its irreducible factors when deg(ga) > d. Michael Rabin pointed out in 1976 
that this can be done by doing arithmetic in the field of p? elements. David G. 
Cantor and Hans Zassenhaus discovered in 1979 that there is an even simpler 
way to proceed, based on the following identity: If p is any odd prime, we have 


galz) = ged (ga(x),4(2)) ged(ga(x), (x)"—0/? +1) ged(ga(z), t(x)?"~/? —1) 

(21) 
for all polynomials t(x), since t(a)P* — t(x) is a multiple of all irreducible poly- 
nomials of degree d. (We may regard t(x) as an element of the field of size p4, 
when that field consists of all polynomials modulo an irreducible f(x) as in 
exercise 16.) Now exercise 29 shows that ged(ga(x), t(z)(*-/2 — 1) will be a 
nontrivial factor of gg(x) about 50 percent of the time, when t(x) is a random 
polynomial of degree < 2d — 1; hence we will not need many random trials 
to discover all of the factors. We may assume without loss of generality that 
t(x) is monic, since integer multiples of t(x) make no difference except possibly 
to change t(x)(»*—1)/2 into its negative. Thus in the case d = 1, we can take 
t(x) = x + s, where s is chosen at random. 

Sometimes this procedure will in fact succeed for d > 1 when only linear 
polynomials t(x) are used. For example, there are eight irreducible polynomials 
f(x) of degree 3, modulo 3, and they will all be distinguished by calculating 
gcd(f(x), (x +s)! — 1) for 0 < s <3: 


f(x) s=0 s=1 s=2 
z? + 2r+1 1 1 1 
z? + 2r+2 fle) f(z) Fæ) 
r? + x 2 f(x) f(x) 1 
z? + r? + z2 f(x) 1 f(x) 
e+ x72+2¢+1 1 f(z) f(z) 
x? + 2g? 1 1 f(a) 1 
r3 +2? + x2+1 1 1 f(x) 
z? + 2g? +2 +2 f(x) 1 1 
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Exercise 31 contains a partial explanation of why linear polynomials can be effec- 
tive. But when there are more than 2? irreducible polynomials of degree d, some 
irreducibles must exist that cannot be distinguished by linear choices of t(x). 

An alternative to (21) that works when p = 2 is discussed in exercise 30. 
Faster algorithms for distinct-degree factorization when p is very large have been 
found by J. von zur Gathen, V. Shoup, and E. Kaltofen; the running time is 
O(n?*¢ + n+ log p) arithmetic operations modulo p for numbers of practical 
size, and O(n©++9/4 log p) such operations as n —> co, when w is the exponent 
of “fast” matrix multiplication in exercise 4.6.4-66. [See Computational Com- 
plexity 2 (1992), 187-224; J. Symbolic Comp. 20 (1995), 363-397; Math. Comp. 
67 (1998), 1179-1197.] 


Historical notes: The idea of finding all the linear factors of a squarefree 
polynomial f(x) modulo p by first calculating g(x) = ged(x?~t — 1, f(x)) and 
then calculating gcd(g(x), (x + s)°-)/? + 1) for arbitrary s is due to A. M. 
Legendre, Mémoires Acad. Sci. (Paris, 1785), 484-490; his motive was to find 
all of the integer solutions to Diophantine equations of the form f(x) = py, 
that is, f(x) = 0 (modulo p). The more general degree-separation technique 
embodied in Algorithm D was discovered by C. F. Gauss before 1800, but not 
published [see his Werke 2 (1876), 237], and then by Evariste Galois in the 
now-classic paper that launched the theory of finite fields [Bulletin des Sciences 
Mathématiques, Physiques et Chimiques 13 (1830), 428-435; reprinted in J. de 
Math. Pures et Appliquées 11 (1846), 398-407]. However, this work of Gauss 
and Galois was ahead of its time, and not well understood until J. A. Serret gave 
a detailed exposition somewhat later [Mémoires Acad. Sci., series 2, 35 (Paris, 
1866), 617-688; Algorithm D is in §7]. Special procedures for splitting ga(x) into 
irreducible factors were devised subsequently by various authors, but methods 
of full generality that would work efficiently for large p were apparently not 
discovered until the advent of computers made them desirable. The first such 
randomized algorithm with a rigorously analyzed running time was published by 
E. Berlekamp [Math. Comp. 24 (1970), 713-735]; it was refined and simplified 
by Robert T. Moenck [Math. Comp. 31 (1977), 235-250], M. O. Rabin [SICOMP 
9 (1980), 273-280], D. G. Cantor and H. J. Zassenhaus |[Math. Comp. 36 (1981), 
587-592]. Paul Camion independently found a generalization to special classes 
of multivariate polynomials [Comptes Rendus Acad. Sci. A291 (Paris, 1980), 
479-482; IEEE Trans. IT-29 (1983), 378-385]. 

The average number of operations needed to factor a random polynomial 
mod p has been analyzed by P. Flajolet, X. Gourdon, and D. Panario, Lecture 
Notes in Comp. Sci. 1099 (1996), 232-243. 


Factoring over the integers. It is somewhat more difficult to find the complete 
factorization of polynomials with integer coefficients when we are not working 
modulo p, but some reasonably efficient methods are available for this purpose. 

Isaac Newton gave a method for finding linear and quadratic factors of 
polynomials with integer coefficients in his Arithmetica Universalis (1707). His 
method was extended by N. Bernoulli in 1708 and, more explicitly, by an as- 
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tronomer named Friedrich von Schubert in 1793, who showed how to find all 
factors of degree n in a finite number of steps; see M. Mignotte and D. Stefanescu, 
Revue d’Hist. Math. 7 (2001), 67-89. L. Kronecker rediscovered their approach 
independently, about 90 years later; but unfortunately the method is very ineffi- 
cient when n is five or more. Much better results can be obtained with the help 
of the “mod p” factorization methods presented above. 

Suppose that we want to find the irreducible factors of a given polynomial 


u(t) = Une” + Unie”? +--+ + uo, Un # 0, 


over the integers. As a first step, we can divide by the greatest common divisor of 
the coefficients; this leaves us with a primitive polynomial. We may also assume 
that u(x) is squarefree, by dividing out gced(u(2), u/(x)) as in exercise 34. 

Now if u(x) = v(a) w(x), where each of these polynomials has integer coef- 
ficients, we obviously have u(x) = v(#) w(x) (modulo p) for all primes p, so there 
is a nontrivial factorization modulo p unless p divides (u). An efficient algorithm 
for factoring u(x) modulo p can therefore be used in an attempt to reconstruct 
possible factorizations of u(x) over the integers. 

For example, let 


u(x) = zË + 7° — 3x1 — 3x3 + 8x? + 2x — 5. (22) 
We have seen above in (19) that 


u(x) = (xt + 2x7 + 3x? +4r+6)(x? +827 +4r+12)(x +3) (modulo 13); (23) 


and the complete factorization of u(x) modulo 2 shows one factor of degree 6 
and another of degree 2 (see exercise 10). From (23) we can see that u(x) has 
no factor of degree 2, so it must be irreducible over the integers. 

This particular example was perhaps too simple; experience shows that most 
irreducible polynomials can be recognized as such by examining their factors 
modulo a few primes, but it is not always so easy to establish irreducibility. For 
example, there are polynomials that can be properly factored modulo p for all 
primes p, with consistent degrees of the factors, yet they are irreducible over the 
integers (see exercise 12). 

A large family of irreducible polynomials is exhibited in exercise 38, and 
exercise 27 proves that almost all polynomials are irreducible over the integers. 
But we usually aren’t trying to factor a random polynomial; there is probably 
some reason to expect a nontrivial factor or else the calculation would not have 
been attempted in the first place. We need a method that identifies factors when 
they are there. 

In general if we try to find the factors of u(x) by considering its behavior 
modulo different primes, the results will not be easy to combine. For example, if 
u(x) is actually the product of four quadratic polynomials, we will have trouble 
matching up their images with respect to different prime moduli. Therefore it is 
desirable to stick to a single prime and to see how much mileage we can get out 
of it, once we feel that the factors modulo this prime have the right degrees. 

One idea is to work modulo a very large prime p, big enough so that the 
coefficients in any true factorization u(x) = v(a) w(x) over the integers must 
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actually lie between —p/2 and p/2. Then all possible integer factors can be 
read off from the factors that we know how to compute mod p. 

Exercise 20 shows how to obtain fairly good bounds on the coefficients of 
polynomial factors. For example, if (22) were reducible it would have a factor 
v(x) of degree < 4, and the coefficients of v would be at most 34 in magnitude 
by the results of that exercise. So all potential factors of u(x) will be fairly 
evident if we work modulo any prime p > 68. Indeed, the complete factorization 
modulo 71 is 


(x + 12)(a + 25) (x? — 13x — 7)(x* — 242° — 16g? + 31x — 12), 


and we see immediately that none of these polynomials could be a factor of (22) 
over the integers since the constant terms do not divide 5; furthermore there is 
no way to obtain a divisor of (22) by grouping two of these factors, since none 
of the conceivable constant terms 12 x 25, 12 x (—7), 12 x (—12) is congruent to 
1 or £5 (modulo 71). 

Incidentally, it is not trivial to obtain good bounds on the coefficients of 
polynomial factors, since a lot of cancellation can occur when polynomials are 
multiplied. For example, the innocuous-looking polynomial «”—1 has irreducible 
factors whose coefficients exceed exp(n!/'8'8") for infinitely many n. [See R. C. 
Vaughan, Michigan Math. J. 21 (1974), 289-295.] The factorization of x” — 1 is 
discussed in exercise 32. 

Instead of using a large prime p, which might need to be truly enormous if 
u(x) has large degree or large coefficients, we can also make use of small p, pro- 
vided that u(x) is squarefree mod p. For in this case, an important construction 
known as Hensel’s Lemma can be used to extend a factorization modulo p in 
a unique way to a factorization modulo p° for arbitrarily high exponents e (see 
exercise 22). If we apply Hensel’s Lemma to (23) with p = 13 and e = 2, we 
obtain the unique factorization 


u(x) = (x — 36)(x2? — 18x? + 82x — 66)(x* + 54r? — 10x? + 69x + 84) 


(modulo 169). Calling these factors vı (a) v3 (a) v4(a), we see that vı (x) and v3(2) 
are not factors of u(x) over the integers, nor is their product v;(x)v3(x) when 
the coefficients have been reduced modulo 169 to the range (— 482... 48°). Thus 
we have exhausted all possibilities, proving once again that u(x) is irreducible 
over the integers — this time using only its factorization modulo 13. 

The example we have been considering is atypical in one important respect: 
We have been factoring the monic polynomial u(x) in (22), so we could assume 
that all its factors were monic. What should we do if un > 1? In such a case, the 
leading coefficients of all but one of the polynomial factors can be varied almost 
arbitrarily modulo pf; we certainly don’t want to try all possibilities. Perhaps the 
reader has already noticed this problem. Fortunately there is a simple way out: 
The factorization u(x) = v(x) w(x) implies a factorization u,u(x) = vi (x) wi(x) 
where (v1) = (w1) = Un = Lu). (“Excuse me, do you mind if I multiply 
your polynomial by its leading coefficient before I factor it?”) We can proceed 
essentially as above, but using p? > 2B where B now bounds the maximum 
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coefficient for factors of u,u(x) instead of u(x). Another way to solve the leading 
coefficient problem is discussed in exercise 40. 
Putting these observations all together results in the following procedure: 


F1. [Factor modulo a prime power.] Find the unique squarefree factorization 
u(x) = C(u)vi(x)...vp(a) (modulo p°), 


where p° is sufficiently large as explained above, and where the v;(x) are 
monic. (This will be possible for all but a few primes p; see exercise 23.) 
Also set d & 1. 


F2. [Try the d-element subfactors.] For every combination of factors v(x) = 
vi, (£)... vi (£), with i; = 1 if d = $r, form the unique polynomial v(x) = 
¢(u)v(x) (modulo p°) whose coefficients all lie in the interval [—4p° .. $p°). 
If v(x) divides €(u)u(x), output the factor pp(v(z)), divide u(x) by this 
factor, and remove the corresponding v;(z) from the list of factors modulo p°; 
decrease r by the number of factors removed, and terminate if d > ir. 


F3. [Loop on d.] Increase d by 1, and return to F2 if d < ir. I 


At the conclusion of this process, the current value of u(x) will be the final 
irreducible factor of the originally given polynomial. Notice that if |uo| < [unl], 
it is preferable to do all of the work with the reverse polynomial uoz” + +--+ Un, 
whose factors are the reverses of the factors of u(x). 

The procedure as stated requires pê > 2B, where B is a bound on the 
coefficients of any divisor of u,u(a), but we can use a much smaller value of B 
if we only guarantee it to be valid for divisors of degree < $ deg(u). In this case 
the divisibility test in step F2 should be applied to w(x) = u1(a)...u,(x)/v(x) 
instead of v(x), whenever deg(v) > 4 deg(u). 

We can decrease B still more if we decide to guarantee only that B should 
bound the coefficients of at least one proper divisor of u(x). (For example, 
when we're factoring a nonprime integer N instead of a polynomial, some of the 
divisors might be very large, but at least one will be < v N.) This idea, due 
to B. Beauzamy, V. Trevisan, and P. S. Wang |J. Symbolic Comp. 15 (1993), 
393-413], is discussed in exercise 21. The divisibility test in step F2 must then 
be applied to both v(x) and w(x), but the computations are faster because p° is 
often much smaller. 

The algorithm above contains an obvious bottleneck: We may have to test 
as many as 2"71 — 1 potential factors v(x). The average value of 2” in a random 
situation is about n, or perhaps nt: (see exercise 5), but in nonrandom situations 
we will want to speed up this part of the routine as much as we can. One way 
to rule out spurious factors quickly is to compute the trailing coefficient 0(0) 
first, continuing only if this divides ¢(u)u(0); the complications explained in 
the preceding paragraphs do not have to be considered unless this divisibility 
condition is satisfied, since such a test is valid even when deg(v) > 4 deg(u). 

Another important way to speed up the procedure is to reduce r so that 
it tends to reflect the true number of factors. The distinct degree factorization 
algorithm above can be applied for various small primes p;, thus obtaining for 
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each prime a set D; of possible degrees of factors modulo p;; see exercise 26. We 
can represent D; as a string of n binary bits. Now we compute the intersection 
N Dj, namely the bitwise “and” of these strings, and we perform step F2 only 
- deg(vn) +++ + deg(vi) € N Dy: 

Furthermore p is chosen to be that p; having the smallest value of r. This 
technique is due to David R. Musser, whose experience suggests trying about five 
primes p; [see JACM 25 (1978), 271-282]. Of course we would stop immediately 
if the current ().D; shows that u(x) is irreducible. 

Musser has given a complete discussion of a factorization method similar 
to the steps above, in JACM 22 (1975), 291-308. Steps F1-F3 incorporate an 
improvement suggested in 1978 by G. E. Collins, namely to look for trial divisors 
by taking combinations of d factors at a time rather than combinations of total 
degree d. This improvement is important because of the statistical behavior of 
the modulo-p factors of polynomials that are irreducible over the rationals (see 
exercise 37). 

A. K. Lenstra, H. W. Lenstra, Jr., and L. Lovász introduced their famous 
“LLL algorithm” in order to obtain rigorous worst-case bounds on the amount of 
computation needed to factor a polynomial over the integers [Math. Annalen 261 
(1982), 515-534]. Their method requires no random numbers, and its running 
time for u(x) of degree n is O(n!? + n9(log ||u||)*) bit operations, where |ju|| is 
defined in exercise 20. This estimate includes the time to search for a suitable 
prime number p and to find all factors modulo p with Algorithm B. Of course, 
heuristic methods that use randomization run noticeably faster in practice. 


Greatest common divisors. Similar techniques can be used to calculate 
greatest common divisors of polynomials: If gcd(u(x),v(x)) = d(x) over the 
integers, and if gcd(u(x),v(x)) = q(x) (modulo p) where q(x) is monic, then 
d(x) is a common divisor of u(x) and v(x) modulo p; hence 


d(x) divides q(x) (modulo p). (24) 


If p does not divide the leading coefficients of both u and v, it does not divide the 
leading coefficient of d; in such a case deg(d) < deg(q). When q(x) = 1 for such 
a prime p, we must therefore have deg(d) = 0, and d(x) = gcd(cont(u), cont(v)). 
This justifies the remark made in Section 4.6.1 that the simple computation of 
gcd (u(x), v(x)) modulo 13 in 4.6.1-(6) is enough to prove that u(x) and v(x) are 
relatively prime over the integers; the comparatively laborious calculations of 
Algorithm 4.6.1E or Algorithm 4.6.1C are unnecessary. Since two random prim- 
itive polynomials are almost always relatively prime over the integers, and since 
they are relatively prime modulo p with probability 1 — 1/p by exercise 4.6.1-5, 
it is usually a good idea to do the computations modulo p. 

As remarked before, we need good methods also for the nonrandom poly- 
nomials that arise in practice. Therefore we wish to sharpen our techniques and 
discover how to find gcd(u(z), v(x)) in general, over the integers, based entirely 
on information that we obtain working modulo primes p. We may assume that 
u(x) and v(x) are primitive. 
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Instead of calculating gcd (u(x), v(x)) directly, it will be convenient to search 
instead for the polynomial 


d(x) = c- gcd(u(z), v(x)), (25) 
where the constant c is chosen so that 
C(d) = gcd (€(u), &(v)). (26) 


This condition will always hold for suitable c, since the leading coefficient of any 
common divisor of u(x) and v(x) must be a divisor of ged(¢(u), €(v)). Once d(x) 
has been found satisfying these conditions, we can readily compute pp(d(z)), 
which is the true greatest common divisor of u(x) and v(x). Condition (26) 
is convenient since it avoids the uncertainty of unit multiples of the gcd; we 
have used essentially the same idea to control the leading coefficients in our 
factorization routine. 

If p is a sufficiently large prime, based on the bounds for coefficients in 
exercise 20 applied either to ¢(d)u(x) or @(d)v(x), let us compute the unique 
polynomial g(x) = ¢(d)q(x) (modulo p) having all coefficients in [—$p.. 4p). 
When pp(q(«)) divides both u(x) and v(x), it must equal ged (u(x), v(x)) because 
of (24). On the other hand if it does not divide both u(x) and v(x) we must 
have deg(q) > deg(d). A study of Algorithm 4.6.1E reveals that this will be the 
case only if p divides the leading coefficient of one of the nonzero remainders 
computed by that algorithm with exact integer arithmetic; otherwise Euclid’s 
algorithm modulo p deals with precisely the same sequence of polynomials as 
Algorithm 4.6.1E except for nonzero constant multiples (modulo p). So only a 
small number of “unlucky” primes can cause us to miss the gcd, and we will soon 
find a lucky prime if we keep trying. 

If the bound on coefficients is so large that single-precision primes p are 
insufficient, we can compute d(x) modulo several primes p until it has been 
determined via the Chinese remainder algorithm of Section 4.3.2. This approach, 
which is due to W. S. Brown and G. E. Collins, has been described in detail by 
Brown in JACM 18 (1971), 478-504. Alternatively, as suggested by J. Moses and 
D. Y. Y. Yun [Proc. ACM Conf. 28 (1973), 159-166], we can use Hensel’s method 
to determine d(x) modulo p° for sufficiently large e. Hensel’s construction 
appears to be computationally superior to the Chinese remainder approach; but 
it is valid directly only when 


d(x) Lu(e)/d(z) or d(æ) L v(x) /d(2), (27) 


since the idea is to apply the techniques of exercise 22 to one of the factorizations 
(d)u(x) = G(x)u1(x) or &(d) v(x) = G(x)v1(x) (modulo p). Exercises 34 and 35 
show that it is possible to arrange things so that (27) holds whenever necessary. 
(The notation 


u(x) L v(x) (28) 


used in (27) means that u(x) and v(x) are relatively prime, by analogy with the 
notation used for relatively prime integers.) 
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The gcd algorithms sketched here are significantly faster than those of Sec- 
tion 4.6.1 except when the polynomial remainder sequence is very short. Per- 
haps the best general procedure would be to start with the computation of 
gcd(u(x), v(x)) modulo a fairly small prime p, not a divisor of both (u) and ¢(v). 
If the result q(x) is 1, we’re done; if it has high degree, we use Algorithm 4.6.1C; 
otherwise we use one of the methods above, first computing a bound for the 
coefficients of d(x) based on the coefficients of u(x) and v(x), and on the (small) 
degree of q(x). As in the factorization problem, we should apply this procedure 
to the reverses of u(x), v(x) and reverse the result, if the trailing coefficients are 
simpler than the leading ones. 


Multivariate polynomials. Similar techniques lead to useful algorithms for 
factorization or gcd calculations on multivariate polynomials with integer coeffi- 
cients. It is convenient to deal with the polynomial u(a21,...,2;) by working 
modulo the irreducible polynomials x2 — a2, ..., £t — at, which play the role of p 
in the discussion above. Since v(x) mod (x — a) = v(a), the value of 


u(a1,...,24) mod {x2 —ao,..., £t — az} 


is the univariate polynomial u(a1,a2,...,az). When the integers ag, ..., a; are 
chosen so that u(a1,d@2,...,a) has the same degree in x as the original poly- 
nomial u(#1,22,...,24), an appropriate generalization of Hensel’s construction 
will “lift” squarefree factorizations of this univariate polynomial to factorizations 
modulo {(#2 — a2)", ..., (£t — at)” }, where nj is the degree of x; in u; at the 
same time we can also work modulo an appropriate integer prime p. As many as 
possible of the a; should be zero, so that sparseness of the intermediate results 
is retained. For details, see P. S. Wang, Math. Comp. 32 (1978), 1215-1231, in 
addition to the papers by Musser and by Moses and Yun cited earlier. 

Significant computational experience has been accumulating since the days 
when the pioneering papers cited above were written. See R. E. Zippel, Effective 
Polynomial Computation (Boston: Kluwer, 1993) for a more recent survey. More- 
over, it is now possible to factor polynomials that are given implicitly by a “black 
box” computational procedure, even when both input and output polynomials 
would fill the universe if they were written out explicitly [see E. Kaltofen and 
B. M. Trager, J. Symbolic Comp. 9 (1990), 301-320; Y. N. Lakshman and 
B. David Saunders, SICOMP 24 (1995), 387-397]. 


The asymptotically best algorithms frequently turn out 
to be worst on all problems for which they are used. 


— D. G. CANTOR and H. ZASSENHAUS (1981) 


EXERCISES 


1. [M24] Let p be prime, and let u(x) be a random polynomial of degree n, assuming 
that each of the p” monic polynomials is equally likely. Show that if n > 2, the 
probability that u(x) has a linear factor mod p lies between (1+p~')/2 and (2+p~?)/3, 
inclusive. Give a closed form for this probability when n > p. What is the average 
number of linear factors? 
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> 2. [M25] (a) Show that any monic polynomial u(x), over a unique factorization 
domain, may be expressed uniquely in the form 


u(x) = v(z)°w(z), 


where w(x) is squarefree (has no factor of positive degree of the form d(x)*) and both 
v(x) and w(x) are monic. (b) (E. R. Berlekamp.) How many monic polynomials of 
degree n are squarefree modulo p, when p is prime? 


3. [M25] (The Chinese remainder theorem for polynomials.) Let u(x), ..., ur(x) be 
polynomials over a field S, with u;(x) L u(x) for all j # k. For any given polynomials 
wi(x), ..., Wr(x) over S, prove that there is a unique polynomial v(x) over S such 


that deg(v) < deg(u1) +--+ deg(ur) and v(x) = w;(x) (modulo uj(z)) for 1 < j <r. 
Does this result hold also when S is the set of all integers? 

4. [HM28] Let anp be the number of monic irreducible polynomials of degree n, 
modulo a prime p. Find a formula for the generating function Gp(z) = So, anpz”. 
[Hint: Prove the following identity connecting power series: f(z) = )7j31 g(27)/9" 
and only if 9(z) = 0,5, w(n) f(z") /n*] What is limp—oo anp/p”? g 

5. [HM30] Let Anp be the average number of irreducible factors of a randomly 
selected polynomial of degree n, modulo a prime p. Show that limp+oAnp = Hn. 
What is the limiting average value of 2”, when r is the number of irreducible factors? 

6. [M21] (J. L. Lagrange, 1771.) Prove the congruence (9). [Hint: Factor x? — x in 
the field of p elements.] 


7. [M22] Prove Eq. (14). 


8. [HM20] How can we be sure that the vectors output by Algorithm N are linearly 
independent? 


9. [20] Explain how to construct a table of reciprocals mod 101 in a simple way, 
given that 2 is a primitive root of 101. 


> 10. [21] Find the complete factorization of the polynomial u(x) in (22), modulo 2, 
using Berlekamp’s procedure. 


11. [22] Find the complete factorization of the polynomial u(a) in (22), modulo 5. 


> 12. [M22] Use Berlekamp’s algorithm to determine the number of factors of u(x) = 
x* +1, modulo p, for all primes p. [Hint: Consider the cases p = 2, p = 8k +1, 
p = 8k +3, p = 8k +5, p = 8k +7 separately; what is the matrix Q? You need not 
discover the factors; just determine how many there are.] 


13. [M25] Continuing the previous exercise, give an explicit formula for the factors 
of xt + 1, modulo p, for all odd primes p, in terms of the quantities /—1, V2, /—2 
when such square roots exist modulo p. 


14. [M25] (H. Zassenhaus.) Let v(x) be a solution to (8), and let w(x) = Į [(æ — s) 
where the product is over all 0 < s < p such that ged(u(z), v(x) — s) Æ 1. Explain 
how to compute w(x), given u(x) and v(x). [Hint: Eq. (14) implies that w(x) is the 
polynomial of least degree such that u(x) divides w(v(z)).] 


> 15. [M27] (Square roots modulo a prime.) Design an algorithm to calculate the square 
root of a given integer u modulo a given prime p, that is, to find an integer v such that 
v? = u (modulo p) whenever such a v exists. Your algorithm should be efficient even 
for very large primes p. (For p Æ 2, a solution to this problem leads to a procedure for 


solving any given quadratic equation modulo p, using the quadratic formula in the usual 
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way.) Hint: Consider what happens when the factorization methods of this section are 
applied to the polynomial x? — u. 


16. [M30] (Finite fields.) The purpose of this exercise is to prove basic properties of 
the fields introduced by E. Galois in 1830. 

a) Given that f(x) is an irreducible polynomial modulo a prime p, of degree n, prove 
that the p” polynomials of degree less than n form a field under arithmetic modulo 
f(x) and p. [Note: The existence of irreducible polynomials of each degree is 
proved in exercise 4; therefore fields with p” elements exist for all primes p and all 
n> 1] 

b) Show that any field with p” elements has a “primitive root” element € such that 
the elements of the field are {0,1,€,€°,...,€?"-2}. [Hint: Exercise 3.2.1.2-16 
provides a proof in the special case n = 1.] 

c) If f(x) is an irreducible polynomial modulo p, of degree n, prove that 2?” — x is 
divisible by f(x) if and only if m is a multiple of n. (It follows that we can test 
irreducibility rather quickly: A given nth degree polynomial f(x) is irreducible 
modulo p if and only if £z?” — x is divisible by f(x) and x2"/7— 2 L f(a) for all 
primes g that divide n.) 


17. [M23] Let F be a field with 13? elements. How many elements of F have order f, 
for each integer f with 1 < f < 13? (The order of an element a is the least positive 
integer m such that a” = 1.) 


18. [M25] Let u(x) = unz” +--+-+ U0, Un # 0, be a primitive polynomial with integer 
coefficients, and let v(x) be the monic polynomial defined by 


v(x) = unt. u(x/un) = £” + un- bun ont? bee uoun + 

(a) Given that v(x) has the complete factorization pı(x)...pr(x) over the integers, 
where each p;(x) is monic, what is the complete factorization of u(x) over the integers? 
(b) If w(x) = 2" +. wm—-12™ | +-+-+ wo is a factor of v(x), prove that wp is a multiple 
of ut E for0<k<m. 


19. [M20] (Eisenstein’s criterion.) Perhaps the best-known class of irreducible poly- 
nomials over the integers was introduced by T. Schönemann in Crelle 32 (1846), 100, 
then popularized by G. Eisenstein in Crelle 39 (1850), 166-169: Let p be prime and 
let u(x) = Unz” + --- + uo have the following properties: (i) un is not divisible by p; 
(ii) un-1, -.-, Uo are divisible by p; (iii) uo is not divisible by p?. Show that u(x) is 
irreducible over the integers. 


20. [HM33| If u(x) = unz” +---+ uo is any polynomial over the complex numbers, 
let Ijul = (fun? + +++ + fuo). 
a) Let u(x) = (x— a)w(x) and v(x) = (ax —1)w(x), where a is any complex number 
and @ is its complex conjugate. Prove that ||u|| = |v]. 
b) Let un(a—a1)...(a@—an) be the complete factorization of u(x) over the complex 
numbers, and write M(u) = |un|[]j_, max(1, |a;|). Prove that M(u) < |lull. 
c) Show that |u;| < ("5*) M(u) + (3x1) lun], forO<j<n. 


d) Combine these results to prove that if u(x) = v(x) w(x) and v(x) = uma" +- - -+vo0, 
where u, v, w all have integer coefficients, then the coefficients of v are bounded by 


les] < (7) Mell + (Fp) lel: 


y= 
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21. [HM32] Continuing exercise 20, we can also derive useful bounds on the coeffi- 
cients of multivariate polynomial factors over the integers. For convenience we will let 
boldface letters stand for sequences of t integers; thus, instead of writing 


= s amil jt 
ulzi, ..., £4) = > hg Gp B EY 
Tl sesdt 


we will write simply u(x) = >; ujxJ. Notice the convention for xJ; we also write 
a) Prove the identity 


1 : p! q! z r!s! 
F [p-j=q -— k] apb : [r -j=s — k] ede 
TIR Ea P d (p-—j)! 2s (r — j)! 
= Soil 5 [p +s=i] apds 5 [lq +r =i] bqc. 
i>0 p,s>0 q,r>0 


b) The polynomial u(x) = >); ujx) is called homogeneous of degree n if each term has 
total degree n; thus we have Xj = n whenever uj # 0. Consider the weighted sum 
of coefficients B(u) = 7, J! |uj|?. Use part (a) to show that B(u) > B(v)B(w) 
whenever u(x) = v(x) w(x) is homogeneous. 

c) The Bombieri norm [u] of a polynomial u(x) is defined to be y B(u)/n! when u 
is homogeneous of degree n. It is also defined for nonhomogeneous polynomials, 
by adding a new variable x;41 and multiplying each term by a power of £t+1 
so that u becomes homogeneous without increasing its maximum degree. For 
example, let u(x) = 4z? + x — 2; the corresponding homogeneous polynomial is 
4z? + xy? — 2y*, and we have [u]? = (3! 0! 4? + 1!2!1? + 0!3! 27)/3! = 16+ 444. 
If u(a, y, z) = 3xy* — 2° we have, similarly, [u]? = (1! 3! 0! 0!3? + 0! 0! 2!2!17)/4! = 
3 + 2. What does part (b) tell us about the relation between [u], [v], and [w], 
when u(x) = v(x) w(x)? 

d) Prove that if u(x) is a reducible polynomial of degree n in one variable, it has a 
factor whose coefficients are at most n!/4[u]!/?/(n/4)! in absolute value. What is 
the corresponding result for homogeneous polynomials in t variables? 

e) Calculate [u] both explicitly and asymptotically when u(x) = (x? — 1)”. 

f) Prove that [u][v] > [ue]. 

g) Show that 27"/?M(u) < [u] < 2"/?M(u), when u(x) is a polynomial of degree n 
and M(w) is the quantity defined in exercise 20. (Therefore the bound in part (d) 
is roughly the square root of the bound we obtained in that exercise.) 


> 22. [M24] (Hensel’s Lemma.) Let u(x), ve(x), we(x), a(x), b(x) be polynomials with 
integer coefficients, satisfying the relations 


u(x) = ve(x%)we(x) (modulo pê), a(x) ve(x) + b(@)we(x) = 1 (modulo p), 


where p is prime, e > 1, ve(x) is monic, deg(a) < deg(we), deg(b) < deg(ve), and 
deg(u) = deg(ve) + deg(we). Show how to compute polynomials ve+41(%) = ve(x) and 
We41(L) = We(x) (modulo p*), satisfying the same conditions with e increased by 1. 
Furthermore, prove that ve41(#) and we+ı(x) are unique, modulo p*t’. 

Use your method for p = 2 to prove that (22) is irreducible over the integers, 
starting with its factorization modulo 2 found in exercise 10. (Note that Euclid’s 
extended algorithm, exercise 4.6.1-3, will get the process started for e = 1.) 
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23. [HM23| Let u(x) be a squarefree polynomial with integer coefficients. Prove that 
there are only finitely many primes p such that u(x) is not squarefree modulo p. 


24. [M20] The text speaks only of factorization over the integers, not over the field 
of rational numbers. Explain how to find the complete factorization of a polynomial 
with rational coefficients, over the field of rational numbers. 


25. [M25] What is the complete factorization of £? + zf + £? +a + 2 over the field of 
rational numbers? 

26. [20] Let di, ..., dr be the degrees of the irreducible factors of u(x) modulo p, 
with proper multiplicity, so that dı +---+d, = n = deg(u). Explain how to compute 
the set {deg(v) | u(x) = v(x) w(x) (modulo p) for some v(x), w(x)} by performing O(r) 
operations on binary bit strings of length n. 


27. [HM30] Prove that a random primitive polynomial over the integers is “almost 
always” irreducible, in some appropriate sense. 


28. [M25] The distinct-degree factorization procedure is “lucky” when there is at 
most one irreducible polynomial of each degree d; then ga(x) never needs to be broken 
into factors. What is the probability of such a lucky circumstance, when factoring a 
random polynomial of degree n, modulo p, for fixed n as p > co? 


29. [M22] Let g(x) be a product of two or more distinct irreducible polynomials of 
degree d, modulo an odd prime p. Prove that ged(g(x), t(z)(?*-1)/2 — 1) will be a 
proper factor of g(x) with probability > 1/2—1/(2p?4), for any fixed g(x), when t(x) is 
selected at random from among the p24 polynomials of degree < 2d modulo p. 


30. [M25] Prove that if q(x) is an irreducible polynomial of degree d, modulo p, and if 
t(x) is any polynomial, then the value of (t(x) +t(£)? +t(£)2? +---+t(x)?* *) mod q(x) 
is an integer (that is, a polynomial of degree < 0). Use this fact to design a randomized 
algorithm for factoring a product ga(x) of degree-d irreducibles, analogous to (21), for 
the case p = 2. 


31. [HM30] Let p be an odd prime and let d > 1. Show that there exists a number 
n(p,d) having the following two properties: (i) For all integers t, exactly n(p, d) 
irreducible polynomials q(x) of degree d, modulo p, satisfy (a+t)(®"-)/2 mod q(x) = 1. 
(ii) For all integers 0 < tı < t2 < p, exactly n(p,d) irreducible polynomials q(x) of 
degree d, modulo p, satisfy (x + t1)(@*-)/2 mod q(x) = (a + tz)(P*-)/2 mod q(z). 


32. [M30] (Cyclotomic polynomials.) Let Yn(x) = [icpen, rin(® - w"), where w = 


emim: thus, the roots of Y%,(a) are the complex nth roots of unity that aren’t mth 
roots for m < n. 


a) Prove that W(x) is a polynomial with integer coefficients, and that 


go -1l= ][ ae); W(x) = [[ = er: 


d\n d\n 


(See exercises 4.5.2-10(b) and 4.5.3-28(c).) 

b) Prove that Y,(x) is irreducible over the integers, hence the formula above is the 
complete factorization of x” — 1 over the integers. [Hint: If f(x) is an irreducible 
factor of W(x) over the integers, and if ¢ is a complex number with f(¢) = 0, 
prove that f(¢?) = 0 for all primes p not dividing n. It may help to use the fact 
that x” — 1 is squarefree modulo p for all such primes.] 

c) Discuss the calculation of W(x), and tabulate the values for n < 15. 
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33. [M18] True or false: If u(x) Æ 0 and the complete factorization of u(x) modulo p 
is pi(x)*!... pr(x)°", then u(x)/ged(u(x), u’ (x)) = pi(x)...pr(z). 

34. [M25] (Squarefree factorization.) It is clear that any primitive polynomial of a 
unique factorization domain can be expressed in the form u(x) = u1(x) u2(x)?u3(a)*..., 
where the polynomials u;(x) are squarefree and relatively prime to each other. This 
representation, in which u;(x) is the product of all the irreducible polynomials that 
divide u(x) exactly j times, is unique except for unit multiples; and it is a useful way to 
represent polynomials that participate in multiplication, division, and gcd operations. 

Let GCD(u(zx), v(x)) be a procedure that returns three answers: 


GCD(u(z), v(x)) = (d(x), u(x) /d(x), v(x) /d(x)), where d(x) = gcd(u(x), v(x)). 


The modular method described in the text following Eq. (25) always ends with a trial 
division of u(x)/d(x) and v(x)/d(x), to make sure that no “unlucky prime” has been 
used, so the quantities u(x)/d(x) and v(«)/d(x) are byproducts of the gcd computation; 
thus we can compute GCD(u(z), v(z)) essentially as fast as gcd(u(x),v(x)) when we 
are using a modular method. 

Devise a procedure that obtains the squarefree representation (ui(2x), u2(z),...) 
of a given primitive polynomial u(x) over the integers. Your algorithm should perform 
exactly e computations of a GCD, where e is the largest subscript with ue(x) Æ 1; 
furthermore, each GCD calculation should satisfy (27), so that Hensel’s construction 
can be used. 


35. [M22] (D. Y. Y. Yun.) Design an algorithm that computes the squarefree rep- 
resentation (w(x), we(x),...) of w(x) = ged(u(x),v(x)) over the integers, given the 
squarefree representations (ui(zx), u2(x),...) and (vı (x), v2(x),...) of u(x) and v(x). 
36. [M27] Extend the procedure of exercise 34 so that it will obtain the squarefree 
representation (wi (x), u2(x),...) of a given polynomial u(x) when the coefficient arith- 
metic is performed modulo p. 


37. [HM24] (George E. Collins.) Let di, ..., dr be positive integers whose sum is n, 
and let p be prime. What is the probability that the irreducible factors of a random nth- 
degree integer polynomial u(x) have degrees di, ..., dr, when it is completely factored 
modulo p? Show that this probability is asymptotically the same as the probability 
that a random permutation on n elements has cycles of lengths di, ..., dr. 


38. [HM27| (Perron’s criterion.) Let u(x) = x” +un—12"—'+---+uo be a polynomial 
with integer coefficients such that uo 4 0 and either |u,_1| > 1+ |un—2|+---+|uo| or 
(un—1 = 0 and un-2 > 1 + |un-3| +--+ [uo|). Show that u(x) is irreducible over the 
integers. [Hint: Prove that almost all of u’s roots are less than 1 in absolute value.] 


39. [HM42] (David G. Cantor.) Show that if the polynomial u(x) is irreducible over 
the integers, it has a “succinct” proof of irreducibility, in the sense that the number of 
bits in the proof is at most a polynomial in deg(u) and the length of the coefficients. 
(Only a bound on the length of proof is requested here, as in exercise 4.5.4-17, not a 
bound on the time needed to find such a proof.) Hint: If v(x) is irreducible and t is 
any polynomial over the integers, all factors of u(t(x)) have degree > deg(v). Perron’s 
criterion gives a large supply of irreducible polynomials v(x). 

40. [M20] (P. S. Wang.) If un is the leading coefficient of u(x) and B is a bound on 
coefficients of some factor of u, the text’s factorization algorithm requires us to find a 
factorization modulo p° where p° > 2|un|B. But |u,| might be larger than B, when 
B is chosen by the method of exercise 21. Show that if u(x) is reducible, there is a way 
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to recover one of its true factors from a factorization modulo p° whenever p° > 2B?, 
by using the algorithm of exercise 4.5.3-51. 

41. [M47] (Beauzamy, Trevisan, and Wang.) Prove or disprove: There is a constant c 
such that, if f(x) is any integer polynomial with all coefficients < B in absolute value, 
then one of its irreducible factors has coefficients bounded by cB. 


4.6.3. Evaluation of Powers 


In this section we shall study the interesting problem of computing x” efficiently, 
given x and n, where n is a positive integer. Suppose, for example, that we need 
to compute x16; we could simply start with 2 and multiply by x fifteen times. 
But it is possible to obtain the same answer with only four multiplications, if 


we repeatedly take the square of each partial result, successively forming x2, x4, 


ee. 

The same idea applies, in general, to any value of n, in the following way: 
Write n in the binary number system (suppressing zeros at the left). Then 
replace each “1” by the pair of letters SX, replace each “0” by S, and cross off 
the “SX” that now appears at the left. The result is a rule for computing x”, if 
“S” is interpreted as the operation of squaring, and if “X” is interpreted as the 
operation of multiplying by x. For example, if n = 23, its binary representation 
is 10111; so we form the sequence SX S SX SX SX and remove the leading SX 
to obtain the rule SSXSXSX. This rule states that we should “square, square, 
multiply by xz, square, multiply by x, square, and multiply by x”; in other words, 
we should successively compute x7, x4, 7°, 219, at!) g??, x23, 

This binary method is easily justified by a consideration of the sequence of 
exponents in the calculation: If we reinterpret “S” as the operation of multiplying 
by 2 and “X” as the operation of adding 1, and if we start with 1 instead of x, 
the rule will lead to a computation of n because of the properties of the binary 
number system. The method is quite ancient; it appeared before A.D. 400 in 
Pingala’s Hindu classic Chandahsastra [see B. Datta and A. N. Singh, History 
of Hindu Mathematics 2 (Lahore: Motilal Banarsi Das, 1935), 76]. There seem 
to be no other references to this method outside of India during the next several 
centuries, but a clear discussion of how to compute 2” efficiently for arbitrary 
n was given by al-Uqlidist of Damascus in A.D. 952; see The Arithmetic of 
al-Uqlidist by A. S. Saidan (Dordrecht: D. Reidel, 1975), 341-342, where the 
general ideas are illustrated for n = 51. See also al-Biruni’s Chronology of 
Ancient Nations, edited and translated by E. Sachau (London: 1879), 132-136; 
this eleventh-century Arabic work had great influence. 

The S-and-X binary method for obtaining x” requires no temporary storage 
except for x and the current partial result, so it is well suited for incorporation in 
the hardware of a binary computer. The method can also be readily programmed; 
but it requires that the binary representation of n be scanned from left to 
right. Computer programs generally prefer to go the other way, because the 
available operations of division by 2 and remainder mod 2 will deduce the binary 
representation from right to left. Therefore the following algorithm, based on a 
right-to-left scan of the number, is often more convenient: 
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A1. Initialize A2. Halve N [Odd y A3. Multiply Y by Z 
Even 
A5. Square Z K No (A4. N= 0? 
Yes 
Vv 


Fig. 13. Evaluation of xz”, based on a right-to-left scan of the binary notation for n. 


Algorithm A (Right-to-left binary method for exponentiation). This algorithm 
evaluates x”, where n is a positive integer. (Here x belongs to any algebraic 
system in which an associative multiplication, with identity element 1, has been 
defined.) 


A1. [Initialize.] Set N + n, Yl, Z & z. 


A2. [Halve N.] (At this point, x” = Y Z%.) Set t ~ N mod 2 and N + |N/2]. 
If t = 0, skip to step A5. 


A3. [Multiply Y by Z.] Set Y + Z times Y. 
A4. |N = 0?] If N = 0, the algorithm terminates, with Y as the answer. 
A5. [Square Z.] Set Z + Z times Z, and return to step A2. I 


As an example of Algorithm A, consider the steps in the evaluation of x?°: 
N Y Z 


After step A1 23 1 z 
After step A5 11 z zx? 
After step A5 5 z’ at 
After step A5 2 a a2 
After step A5 1 æ’ gê 
After step A4 0 gz” gô 


A MIX program corresponding to Algorithm A appears in exercise 2. 


The great calculator al-Kashi stated Algorithm A in A.D. 1427 [Istoriko-Mat. 
Issledovaniia 7 (1954), 256-257]. The method is closely related to a procedure 
for multiplication that was actually used by Egyptian mathematicians as early as 
2000 B.C.; for if we change step A3 to “Y + Y +Z” and step A5 to “Z + Z+Z”, 
and if we set Y to zero instead of unity in step Al, the algorithm terminates 
with Y = ng. [See A. B. Chace, The Rhind Mathematical Papyrus (1927); 
W. W. Struve, Quellen und Studien zur Geschichte der Mathematik A1 (1930).] 
This is a practical method for multiplication by hand, since it involves only 
the simple operations of doubling, halving, and adding. It is often called the 
“Russian peasant method” of multiplication, since Western visitors to Russia in 
the nineteenth century found the method in wide use there. 
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The number of multiplications required by Algorithm A is 
[Ign] + v(m), 


where y(n) is the number of ones in the binary representation of n. This is 
one more multiplication than the left-to-right binary method mentioned at the 
beginning of this section would require, due to the fact that the first execution 
of step A3 is simply a multiplication by unity. 

Because of the bookkeeping time required by this algorithm, the binary 
method is usually not of importance for small values of n, say n < 10, unless the 
time for a multiplication is comparatively large. If the value of n is known in 
advance, the left-to-right binary method is preferable. In some situations, such 
as the calculation of x” mod u(x) discussed in Section 4.6.2, it is much easier 
to multiply by x than to perform a general multiplication or to square a value, 
so binary methods for exponentiation are primarily suited for quite large n in 
such cases. If we wish to calculate the exact multiple-precision value of x”, 
when «x is an integer greater than the computer word size, binary methods are 
not much help unless n is so huge that the high-speed multiplication routines 
of Section 4.3.3 are involved; and such applications are rare. Similarly, binary 
methods are usually inappropriate for raising a polynomial to a power; see R. J. 
Fateman, SICOMP 3 (1974), 196-213, for a discussion of the extensive literature 
on polynomial exponentiation. 

The point of these remarks is that binary methods are nice, but not a 
panacea. They are most applicable when the time to multiply xt- x% is essentially 
independent of j and k (for example, when we are doing floating point multi- 
plication, or multiplication mod m); in such cases the running time is reduced 
from order n to order logn. 


Fewer multiplications. Several authors have published statements (without 
proof) that the binary method actually gives the minimum possible number of 
multiplications. But that is not true. The smallest counterexample is n = 15, 
when the binary method needs six multiplications, yet we can calculate y = x? 
in two multiplications and x15 = y° in three more, achieving the desired result 
with only five multiplications. Let us now discuss some other procedures for 
evaluating x”, assuming that n is known in advance. Such procedures are of 
interest, for example, when an optimizing compiler is generating machine code. 

The factor method is based on a factorization of n. If n = pq, where p is the 
smallest prime factor of n and q > 1, we may calculate x” by first calculating x? 
and then raising this quantity to the qth power. If n is prime, we may calculate 
x”! and multiply by x. And, of course, if n = 1, we have x” with no calculation 
at all. Repeated application of these rules gives a procedure for evaluating x”, 
given any value of n. For example, if we want to calculate x55, we first evaluate 
y = 2° = gtr = (x?)22; then we form y!! = yy = (y?)°y. The whole process 
takes eight multiplications, while the binary method would have required nine. 
The factor method is better than the binary method on the average, but there 
are cases (n = 33 is the smallest example) where the binary method excels. 
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AN, LA /\ t AL AS 


26 25 30 40 7 36 48 
38 35 42 29 31 56 44 46 39 52 50 45 60 41 43 80 54 37 72 49 51 96 66 68 65 128 


Fig. 14. The “power tree.” 


The binary method can be generalized to an m-ary method as follows: Let 
n = domt +d,m'~!+--++d;, where 0 < dj < m for 0 < j < t. The computation 
begins by forming 2, x?, x’, ..., 2~1. (Actually, only those powers x“ such 
that d; appears in the representation of n are needed, and this observation often 
saves some of the work.) Then raise 2“ to the mth power and multiply by x”; 
we have computed yı = 22™t+%, Next, raise yı to the mth power and multiply 
by z®%, obtaining yo = co™ +%m+d2_ The process continues in this way until 
ye = x” has been computed. Whenever dj = 0, it is of course unnecessary to 
multiply by x2. Notice that this method reduces to the left-to-right binary 
method discussed earlier, when m = 2; there is also a less obvious right-to- 
left m-ary method that takes more memory but only a few more steps (see 
exercise 9). If m is a small prime, the m-ary method will be particularly efficient 
for calculating powers of one polynomial modulo another, when the coefficients 
are treated modulo m, because of Eq. 4.6.2-(5). 

A systematic method that gives the minimum number of multiplications for 
all of the relatively small values of n (in particular, for most n that occur in 
practical applications) is indicated in Fig. 14. To calculate x”, find n in this 
tree; then the path from the root to n indicates a sequence of exponents that 
occur in an efficient evaluation of x”. The rule for generating this “power tree” 
appears in exercise 5. Computer tests have shown that the power tree gives 
optimum results for all of the n listed in the figure. But for large enough values 
of n the power tree method is not always optimum; the smallest examples are 
n = 77, 154, 233. The first case for which the power tree is superior to both the 
binary method and the factor method is n = 23. The first case for which the 
factor method beats the power tree method is n = 19879 = 103-193; such cases 
are quite rare. (For n < 100,000 the power tree method is better than the factor 
method 88,803 times; it ties 11,191 times; and it loses only 6 times.) 
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Fig. 15. A tree that minimizes the number of multiplications, for n < 100. 


Addition chains. The most economical way to compute x” by multiplication 
is a mathematical problem with an interesting history. We shall now examine 
it in detail, not only because it is classical and interesting in its own right, but 
because it is an excellent example of the theoretical questions that arise in the 
study of optimum methods of computation. 

Although we are concerned with multiplication of powers of x, the problem 
can easily be reduced to addition, since the exponents are additive. This leads 
us to the following abstract formulation: An addition chain for n is a sequence 
of integers 

l=ao, @, 2, .--, G@=n (1) 
with the property that 
ai = aj; + Ak, for some k < j <i, (2) 


for all į = 1, 2, ..., r. One way of looking at this definition is to consider a 
simple computer that has an accumulator and is capable of the three operations 
LDA, STA, and ADD; the machine begins with the number 1 in its accumulator, 
and it proceeds to compute the number n by adding together previous results. 
Notice that a; must equal 2, and a% is either 2, 3, or 4. 

The shortest length, r, for which there exists an addition chain for n is 
denoted by l(n). Thus (1) = 0, 1(2) = 1, 1(3) = 1(4) = 2, etc. Our goal in the 
remainder of this section is to discover as much as we can about this function 
i(n). The values of I(n) for small n are displayed in tree form in Fig. 15, which 
shows how to calculate x” with the fewest possible multiplications for all n < 100. 

The problem of determining I(n) was apparently first raised by H. Dellac in 
1894, and a partial solution by E. de Jonquiéres mentioned the factor method 
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[see L’Intermédiaire des Mathématiciens 1 (1894), 20, 162-164]. In his solution, 

de Jonquières listed what he felt were the values of l(p) for all prime numbers 

p < 200, but his table entries for p = 107, 149, 163, 179, 199 were one too high. 
The factor method tells us immediately that 


(mn) < U(m) + U(n), (3) 
since we can take the chains 1, a1, ..., a7 = mand 1, b1, ..., bs =n and form 
the chain 1, a1, ..., ar, arbi, ..., arbs = mn. 


We can also recast the m-ary method into addition-chain terminology. Con- 
sider the case m = 2*, and write n = domt + dımt™! + --- + d in the m-ary 
number system; the corresponding addition chain takes the form 


1,2,3,...,m—2,m-—1, 
2do, 4do, . . . , mdo, Mdo + di, 
2(mdo+d1), 4(mdo+dı),..., m(Mmdo + d1), M?do + md, + do, 
ae mtdo + mtdy ees + di. (4) 


The length of this chain is m—2+(k+1)t; and it can often be reduced by deleting 
certain elements of the first row that do not occur among the coefficients dj, plus 
elements among 2dp, 4do, ... that already appear in the first row. Whenever 
digit dj is zero, the step at the right end of the corresponding line may, of course, 
be dropped. Furthermore, we can omit all the even numbers (except 2) in the 
first row, if we bring values of the form d; /2° into the computation e steps earlier. 
[See E. Wattel and G. A. Jensen, Math. Centrum Report ZW1968-001 (1968), 
18 pp.; E. G. Thurber, Duke Math. J. 40 (1973), 907-913.] 

The simplest case of the m-ary method is the binary method (m = 2), 
when the general scheme (4) simplifies to the “S” and “X” rule mentioned at the 
beginning of this section: The binary addition chain for 2n is the binary chain 
for n followed by 2n; for 2n + 1 it is the binary chain for 2n followed by 2n + 1. 
From the binary method we conclude that 


(2° + 2% +..-42%) < eg +t, if eo >61 > > ey, SO. (5) 


Let us now define two auxiliary functions for convenience in our subsequent 
discussion: 


A(n) = [len]; (6) 
v(n) = number of 1s in the binary representation of n. (7) 
Thus (17) = 4, v(17) = 2; these functions may be defined by the recurrence 
relations 
(1) = 0, A(2n) = A(2n + 1) = A(n) +1; (8) 
v(1)=1, v(2n) = v(n), v(2n+ 1) =v(n) +1. (9) 
In terms of these functions, the binary addition chain for n requires exactly 
A(n) + y(n) — 1 steps, and (5) becomes 
In) < A(n) + v(n) = 1. (10) 
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Special classes of chains. We may assume without any loss of generality that 
an addition chain is ascending, 


1 = ao <a, < a2 < -<ar =n. (11) 


For if any two a’s are equal, one of them may be dropped; and we can also 
rearrange the sequence (1) into ascending order and remove terms > n without 
destroying the addition chain property (2). From now on we shall consider only 
ascending chains, without explicitly mentioning this assumption. 

It is convenient at this point to define a few special terms relating to addition 
chains. By definition we have, for 1< i< r, 


ai = aj + ak (12) 


for some j and k, 0 < k < j < i. If this relation holds for more than one 
pair (j,k), we let j be as large as possible. Let us say that step i of (11) is 
a doubling, if j = k = i — 1; then a; has the maximum possible value 2a;_1 
that can follow the ascending chain 1, a1, ..., aj—-1. If j (but not necessarily k) 
equals i — 1, let us say that step i is a star step. The importance of star steps is 
explained below. Finally let us say that step i is a small step if A(a;) = A(ai_1). 
Since aj_1 < a; < 2a;_1, the quantity \(a;) is always equal to either A(a;_1) or 
A(ai—1) + 1; it follows that, in any chain (11), the length r is equal to A(n) plus 
the number of small steps. 

Several elementary relations hold between these types of steps: Step 1 is 
always a doubling. A doubling obviously is a star step, but never a small step. 
A doubling must be followed by a star step. Furthermore if step i is not a small 
step, then step i+ 1 is either a small step or a star step, or both; putting this 
another way, if step i + 1 is neither small nor star, step 7 must have been small. 

A star chain is an addition chain that involves only star steps. This means 
that each term a; is the sum of a;_; and a previous ag; the simple “computer” 
discussed above after Eq. (2) makes use only of the two operations STA and 
ADD (not LDA) in a star chain, since each new term of the sequence utilizes 
the preceding result in the accumulator. Most of the addition chains we have 
discussed so far are star chains. The minimum length of a star chain for n is 
denoted by l* (n); clearly 


I(n) < l (n). (13) 
We are now ready to derive some nontrivial facts about addition chains. First 
we can show that there must be fairly many doublings if r is not far from A(n). 


Theorem A. If the addition chain (11) includes d doublings and f = r — d 
nondoublings, then 
n < 201 F p43. (14) 


Proof. By induction on r = d + f, we see that (14) is certainly true when 


r = 1. When r > 1, there are three cases: If step r is a doubling, then 
in = @r—1ı < 21-2 F p43; hence (14) follows. If steps r and r — 1 are both 


nondoublings, then a,_1 < 24'F p42 and ap—-2 < 271 Fy41; hence n = a, < 


468 ARITHMETIC 4.6.3 


ar—1 t+ Gp_2 < 24-1( Frio + Fri) = 27-1 Fis by the definition of the Fibonacci 
sequence. Finally, if step r is a nondoubling but step r — 1 is a doubling, then 
ar—2 < 24-2 Frio and n = ar < ap_1 + ar—2 = 3ar_2. Now 2F 743 — 3F p42 = 
F441 — Fy > 0; hence n < 22-1 Pes in all cases. J 


The method of proof we have used shows that inequality (14) is “best 
possible” under the stated assumptions; the addition chain 


LD EEY a hey oe yin eo eg (15) 
has d doublings and f nondoublings. 


Corollary A. If the addition chain (11) includes f nondoublings and s small 
steps, then 
s<f <3.271s. (16) 


Proof. Obviously s < f. We have 2) < n < 22-1 Fri 3 < 21Qf = galls hays, 
since d+ f = A(n) + s, and since Fy43 < 26 when f > 0. Hence 0 < sln2 + 
f ln(@/2), and (16) follows from the fact that In2/In(2/@) ~ 3.2706. J 


Values of I(n) for special n. It is easy to show by induction that a; < 2%, 
and therefore lgn < r in any addition chain (11). Hence 


I(n) > fgn]. (17) 


This lower bound, together with the upper bound (10) given by the binary 
method, gives us the values 


(24) = A; (18) 
(24427)=A+1, ifA>B. (19) 


In other words, the binary method is optimum when v(n) < 2. With some 
further calculation we can extend these formulas to the case v(n) = 3: 


Theorem B. (24427 +20) =A+2, ifA>B>C. (20) 


Proof. We can, in fact, prove a stronger result that will be of use to us later 
in this section: All addition chains with exactly one small step have one of the 
following six types (where all steps indicated by “...” represent doublings): 


Type 1. 1, ..., 24, 24 +28, ..., 24t0 eS AS B>0,C>0. 


Type 2. 1, ..., 2^, 24 + 2B, 241 + 2B, on OO } 2B+C. A > B >00, 
C >0. 

Type 3. 1, ..., 24, 24 +2471 24+1 | 24-1 24+2 ee A >0,C>2. 

Type 4. 1, ..., 24, 24 +241 24+1 } 24 24+2 Pe A >0,C >2. 

Type 5: 1, yO, OF PIAI oy JATO TE AFCE ee de, 


Dore TE ie 2A+C+D-2. A >0, C >0, D >0. 
Type 6. 1, ..., 24, 2^ +28, 241 OAS B>0,C>1. 
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A straightforward hand calculation shows that these six types exhaust all 
possibilities. By Corollary A, there are at most three nondoublings when there 
is one small step; this maximum occurs only in sequences of Type 3. All of the 
above are star chains, except Type 6 when B < A—1. 

The theorem now follows from the observation that 


(24 +27 +20) < A+2: 


and 1(24 + 28 + 2°) must be greater than A + 1, since none of the six possible 
types have p(n) >2. I 


(E. de Jonquières stated without proof in 1894 that I(m) > A(n) + 2 when 
v(n) > 2. The first published demonstration of Theorem B was by A. A. Gioia, 
M. V. Subbarao, and M. Sugunamma in Duke Math. J. 29 (1962), 481-487.) 

The calculation of 1(24 +27 + 2°42”), when A > B > C > D, is more 
involved. By the binary method it is at most A+3, and by the proof of Theorem B 
it is at least A+ 2. The value A + 2 is possible, since we know that the binary 
method is not optimal when n = 15 or n = 23. The complete behavior when 
v(n) = 4 can be determined, as we shall now see. 


Theorem C. If v(n) > 4 then I(n) > X(n) + 3, except in the following 
circumstances when A > B > C > D and (24 +28 + 2° +2”) equals A +2: 


Case 1. A — B = C — D. (Example: n = 15.) 

Case 2. A— B= C — D +1. (Example: n = 23.) 

Case 3. A — B = 3, C — D = 1. (Example: n = 39.) 

Case 4. A — B = 5, B- C = C — D = 1. (Example: n = 135.) 


Proof. When I(n) = A(n) + 2, there is an addition chain for n having just two 
small steps; such an addition chain starts out as one of the six types in the proof 
of Theorem B, followed by a small step, followed by a sequence of nonsmall 
steps. Let us say that n is “special” if n = 24 +28 +2° +2? for one of the four 
cases listed in the theorem. We can obtain addition chains of the required form 
for each special n, as shown in exercise 13; therefore it remains for us to prove 
that no chain with exactly two small steps contains any elements with v(a;) > 4 
except when a; is special. 

Let a “counterexample chain” be an addition chain with two small steps 
such that v(a,) > 4, but a, is not special. If counterexample chains exist, let 
1 = ao < a, < ++: < a, = n be a counterexample chain of shortest possible 
length. Then step r is not a small step, since none of the six types in the proof 
of Theorem B can be followed by a small step with v(n) > 4 except when n is 
special. Furthermore, step r is not a doubling, otherwise ag, ..., @-—, would 
be a shorter counterexample chain; and step r is a star step, otherwise ao, ..., 
ar—2, ar would be a shorter counterexample chain. Thus 


Gp = Ap_1 + Op_p, k>2; and X(a;) = A(a,_1) + 1. (21) 
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Let c be the number of carries that occur when a,_, is added to a,_, in the 
binary number system by Algorithm 4.3.1A. Using the fundamental relation 


v(ar) = v(ar—1) + v(ar-k) — c, (22) 


we can prove that step r — 1 is not a small step (see exercise 14). 

Let m = A(ap—1). Since neither r nor r— 1 is a small step, c > 2; and c = 2 
can hold only when a,_, > 2™ + 2™71, 

Now let us suppose that r — 1 is not a star step. Then r — 2 is a small step, 
and ao, ..-, @p—3, @r—1 is a chain with only one small step; hence v(ap—1) < 2 
and v(ar—2) < 4. The relation (22) can now hold only if v(a,) = 4, v(ay_1) = 2, 
k = 2, c = 2, v(dp2) = 4. From c = 2 we conclude that a,.4 = 2" + 2"-'; 
hence ao, @1, ..., @rp_3 = 2! +2"? is an addition chain with only one small 
step, and it must be of Type 1, so a, belongs to Case 3. Thus r—1 is a star step. 

Now assume that a,_; = 2°a,_, for some t. If v(a,_1) < 3, then by (22), 
c= 2, k = 2, and we see that a, must belong to Case 3. On the other hand, if 
v(ar—1) = 4 then a,_1 is special, and it is easy to see by considering each case 
that a, also belongs to one of the four cases. (Case 4 arises, for example, when 
Ap—1 = 90, ap_~ = 45; or ar—ı = 120, ar- = 15.) Therefore we may conclude 
that ar—ı Æ 2'a,_», for any t. 

We have proved that a,_1 = ar_2 + Gr—q for some q > 2. If k = 2, then 
q > 2, and ao, @1, ..., Gr—2, 2ar—2, 2ap_2 + Gr_q = ar is a counterexample 
sequence in which k > 2; therefore we may assume that k > 2. 

Let us now suppose that A(a,_,) = m — 1; the case A(a,_~) < m — 1 may 
be ruled out by similar arguments, as shown in exercise 14. If k = 4, both 
r —2 and r — 3 are small steps; hence a,_4 = 2™ +, and (22) is impossible. 
Therefore k = 3; step r — 2 is small, v(a,_3) = 2, c = 2, ap_y > 2% 427-1, 
and v(ar—1) = 4. There must be at least two carries when a,—2 is added to 
Ap—1 — ar—2; hence v(a,_2) = 4, and a,_»2 (being special and > tapı) has the 
form 2™—142™—-2494+11494 for some d. Now ap—1 is either 2+2™—!424+1494 
or 2™ + 2™=1 4 292 1 9¢+1 and in both cases ap—3 must be 27-1 + 2™-2, so 
ar belongs to Case 3. J 


E. G. Thurber [Pacific J. Math. 49 (1973), 229-242] has extended Theorem C 
to show that I(n) > A(n) + 4 when v(n) > 8. It seems reasonable to conjecture 
that I(n) > A(n) +lgv(n) in general, since A. Schénhage has come very close to 
proving this (see exercise 28). 


* Asymptotic values. Theorem C indicates that it is probably quite difficult to 
get exact values of l(n) for large n, when v(n) > 4; however, we can determine 
the approximate behavior in the limit as n > oo. 


Theorem D. [A. Brauer, Bull. Amer. Math. Soc. 45 (1939), 736-739.] 
Jim [*(n)/A(n) = lim U(n)/An) = 1. (23) 


Proof. The addition chain (4) for the 2*-ary method is a star chain if we delete 
the second occurrence of any element that appears twice in the chain; for if a; 
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is the first element among 2do, 4do, ... of the second line that is not present in 
the first line, we have a; < 2(m — 1); hence a; = (m — 1) + a; for some a, in the 
first line. By totaling up the length of the chain, we have 


A(n) <U(n) < I*(n) < (1+ 1/k) Ign +2" (24) 
for all k > 1. The theorem follows if we choose, say, k =|51gX(n)|. 1 


If we let k = AX(n) — 2AXAX(n) in (24) for large n, where AA(n) denotes 
A(A(n)), we obtain the stronger asymptotic bound 


Ln) < I*(n) < A(n) + A(n)/AX(n) + O(A(n)AAA(N)/AX(N)?). (25) 


The second term A(n)/AA(n) is essentially the best that can be obtained from 
(24). A much deeper analysis of lower bounds can be carried out, to show that 
this term A(n)/AX(n) is, in fact, essential in (25). In order to see why this is so, 
let us consider the following fact: 


Theorem E. [Paul Erdős, Acta Arithmetica 6 (1960), 77-81.] Let e be a 
positive real number. The number of addition chains (11) such that 


A(n) =m, r<m-+(l—6)m/X(m) (26) 


is less than a™, for some a < 2, for all suitably large m. (In other words, the 
number of addition chains so short that (26) is satisfied is substantially less than 
the number of values of n such that A(n) = m, when m is large.) 


Proof. We want to estimate the number of possible addition chains, and for this 
purpose our first goal is to get an improvement of Theorem A that enables us to 
deal more satisfactorily with nondoublings. 


Lemma P. Let ô< 2-1 be a fixed positive real number. Call step i of an 
addition chain a “ministep” if it is not a doubling and if a; < a;(1+6)'~ for 
some j, where 0 < j < i. If the addition chain contains s small steps and t 
ministeps, then 


t<s/(1-8), where (1 +6)? = 2°. (27) 

Proof. For each ministep iz, 1 < k < t, we have a;, < aj;,(1+6)'*~%* for some 

jk < ip. Let ,..., I be the intervals (j1 .. i1], ..., (g¢.. it], where the notation 

(j..i] stands for the set of all integers k such that j < k < i. It is possible (see 

exercise 17) to find nonoverlapping intervals Ji, ..., Jn = (j1 -- 74], ---. (Jh -ih ] 
such that 

,U:+:-Uh =, U---Udn, 
ut (28) 
ay < ay (1+8)? I), for 1 <k <A. 
Now for all steps ¿ outside of the intervals J1, ..., Ja we have a; < 2a;_1; hence 


if we let 


q= (4 = ji) +t (in — jn), 
we have gAln) <n< oa os 6)24 = 9A(n)+s—(1—98)q < 9rX(n)+s—(1—-8)t l 
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Returning to the proof of Theorem E, let us choose ô = 26/4 — 1, and let us 
divide the r steps of each addition chain into three classes: 


t ministeps, u doublings, v other steps, ttutver. (29) 


Counting another way, we have s small steps, where s +m = r. By the hypoth- 
eses, Theorem A, and Lemma P, we obtain the relations 


s<(1—e€)m/X(m), t+v < 3.271s, t < s/(1 — €/2). (30) 


Given s, t, u, v satisfying these conditions, there are 


a a ee (3) 


ways to assign the steps to the specified classes. Given such a distribution of 
the steps, let us consider how the non-ministeps can be selected: If step i is 
one of the “other” steps in (29), a; > (1+ d)aj-1, so a; = aj + ax, where 
daj-1 < ak < Qj < aj_1. Also Qj < a;/(1 + 6) < 2a;-1/(1 + eiT, sO 
ô < 2/(1+6)'-4. This gives at most 8 choices for j, where £ is a constant that 
depends only on ô. There are also at most 8 choices for k, so the number of 
ways to assign j and k for each of the non-ministeps is at most 


p”. (32) 

Finally, once the “j” and “k” have been selected for each of the non- 
ministeps, there are fewer than 

C) (9) 

i 33 


ways to choose the j and the k for the ministeps: We select t distinct pairs 
(jı, k1), ---, (je, kt) of indices in the range 0 < kn < jn < r, in fewer than (33) 
ways. Then for each ministep i, in turn, we use a pair of indices (ja, kn) such 
that 

a) jn < i; 

b) aj, +ak, is as small as possible among the pairs not already used for smaller 

ministeps 7; 

c) ai = aj, + ax, satisfies the definition of ministep. 
If no such pair (jn, kn) exists, we get no addition chain; on the other hand, any 
addition chain with ministeps in the designated places must be selected in one 
of these ways, so (33) is an upper bound on the possibilities. 

Thus the total number of possible addition chains satisfying (26) is bounded 
by (31) times (32) times (33), summed over all relevant s, t, u, and v. The proof 
of Theorem E can now be completed by means of a rather standard estimation 
of these functions (exercise 18). q 


Corollary E. The value of l(n) is asymptotically A(n) + A(n)/AXA(n), for almost 
all n. More precisely, there is a function f(n) such that f(n) > 0 as n > oo, 
and 

Pr(|l(n) — A(n) = A(n)/AX(n)| > FAnn) Aln) ) = 0. (34) 
(See Section 3.5 for the definition of this probability “Pr”.) 
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Proof. The upper bound (25) shows that (34) holds without the absolute value 
signs. The lower bound comes from Theorem E, if we let f(n) decrease to zero 
slowly enough so that, when f(n) < e, the value N is so large that at most eN 
values n < N have I(n) < A(n) + (1-6) A(n)/AX(n). I 


*Star chains. Optimistic people find it reasonable to suppose that l(n) = I*(n); 
given an addition chain of minimal length /(n), it appears hard to believe that 
we cannot find one of the same length that satisfies the (apparently mild) star 
condition. But in 1958 Walter Hansen proved the remarkable theorem that, for 
certain large values of n, the value of I(n) is definitely less than /*(n), and he 
also proved several related theorems that we shall now investigate. 

Hansen’s theorems begin with an investigation of the detailed structure of 
a star chain. Let n = 2° + 2° +.-.+ 2°, where eg > e1 >- > & > 0, and 
let 1 = ao < a, < -++ < ar = n bea star chain for n. If there are d doublings in 
this chain, we define the auxiliary sequence 


0= do < dı < d2 <- < d, =d, (35) 
where d; is the number of doublings among steps 1, 2, ..., i. We also define 
a sequence of “multisets” So, S1, ..., Sr, which keep track of the powers of 2 


present in the chain. (A multiset is a mathematical entity that is like a set, 
but it is allowed to contain repeated elements; an object may be an element 
of a multiset several times, and its multiplicity of occurrences is relevant. See 
exercise 19 for familiar examples of multisets.) The multisets S; are defined by 
the rules 


a) So = {0}; 
b) If aj41 = 2a;, then Siyi = $;+1={#+1]|ae€ Sj}; 
c) If aig, = a; + an, k <i, then Sj41 = Si W Sp. 
(The symbol w means that the multisets are combined, adding the multi- 
plicities.) From this definition it follows that 


re S; 


where the terms in this sum are not necessarily distinct. In particular, 
m= 24244... 42% = So 2 (37) 
LES, 


The number of elements in the latter sum is at most 2f, where f = r — d is the 
number of nondoublings. 
Since n has two different binary representations in (37), we can partition 


the multiset S, into multisets Mo, M1, ..., Ma such that 
Sie S25 Pace (38) 
2eM; 


This can be done by arranging the elements of S, into nondecreasing order 
zı < £2 < --- and taking Mi = {£1, £2,..., £k}, where 271 +---+ 2% = 2%, 
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This must be possible, since e; is the smallest of the e’s. Similarly, M—ı = 
{Tk+1; Tk+2;---, Zk}, and so on; the process is easily visualized in binary nota- 
tion. An example appears below. 

Let M; contain m; elements (counting multiplicities); then mj < 2/ — t, 
since S, has at most 2f elements and it has been partitioned into t+1 nonempty 
multisets. By Eq. (38), we can see that 

ej > £ > êj — Mj, for all x € M;. (39) 

Our examination of the star chain’s structure is completed by forming the 
multisets M;j that record the ancestral history of Mj. The multiset S; is 
partitioned into t + 1 multisets as follows: 

a) M,.; = M;; 

b) If Qi+1 = 2a;, then Mi; = Moi41)j -l= {x —1 | TE Mayas 

c) If aj41 = ai + ak, k < i, then (since Sipi = S; W Sk) we let Mij = Miü+1)j 
minus S;, that is, we remove the elements of Sk from M41);. If some 
element of S, appears in two or more different multisets M(;+1);, we remove 
it from the set with the largest possible value of j; this rule uniquely defines 

Mj; for each j, when 7 is fixed. 

From this definition it follows that 


Qe dad Sa Se di — d- mj, for all x € Mij. (40) 


As an example of this detailed construction, let us consider the star chain 
1, 2, 3, 5, 10, 20, 23, for which t = 3, r = 6, d = 3, f = 3. We obtain the 
following array of multisets: 


(do, di,...,de) : 0 1 1 1 2 3 3 
(ao, Q1,.---, as) 1 2 3 5 10 | 20 | 23 
(Mos, Mj3,..., Mes) : 0 | M3 eg =0,m3=1 
(Mo2, Miz,..., M62) : 1] M eg=1,m=1 
(Moi, Mis, $ , M61) x 0 1 2 2 Mı ey = 2, mı = 1 
i 0|/1/1/1|/2]/3]|3 z = 
(Moo, Mio, .--, Meo) { floats |m eo = 4, Mmo = 2 


So Si So S3 S4 S5 S6 


Thus Mio = {2,2}, etc. From the construction we can see that d; is the largest 
element of Si; hence 
di € Mio. (41) 
The most important part of this structure comes from Eq. (40); one of its 
immediate consequences is 


Lemma K. If Mj; and Mw both contain a common integer x, then 
—My < (ej — ev) — (du — di) < Mj. 1 (42) 


Although Lemma K may not look extremely powerful, it says (when mj 
and my are reasonably small and when M;; contains an element in common 
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with Mux) that the number of doublings between steps u and i is approximately 
equal to the difference between the exponents e, and e;. This imposes a certain 
amount of regularity on the addition chain; and it suggests that we might be 
able to prove a result analogous to Theorem B above, that /*(n) = eo + t, if the 
exponents ej are far enough apart. The next theorem shows how this can in fact 
be done. 


Theorem H. [W. Hansen, Crelle 202 (1959), 129-136.] Let n = 2° + 2° + 
--- + 2°, where eọ > e1 >- > ez, > O. If 


eo > 2e, + 2.271(t — 1) and e;-1 >e;+2m forl <i<t, (43) 
where m = 2|3-271¢-))] _ t, then I*(n) = eo + t. 


Proof. We may assume that t > 2, since the result of the theorem is true 
without restriction on the e’s when t < 2. Suppose that we have a star chain 
1 = ao <a, <: <a, = n for n with r < eo +t—1. Let the integers d, f, 
do, ..., dr, and the multisets M;, Si, Mj; reflect the structure of this chain, as 
defined above. By Corollary A, we know that f < |3.271(t — 1)|; therefore the 
value of m is a bona fide upper bound for the number m; of elements in each 
multiset Mj. 
In the summation 


(3 2)+(z 2) (2) 
re Mio re Mi xEMit 


no carries propagate from the term corresponding to M;; to the term correspond- 
ing to Mj(;_1), if we think of this sum as being carried out in the binary number 
system, since the e’s are so far apart. (See (40).) In particular, the sum of all 
the terms for j # 0 will not carry up to affect the terms for j = 0, so we must 
have 
ae XO #22, Iter (44) 
xe Mio 

In order to prove Theorem H, we would like to show that in some sense the 
t extra powers of n must be put in “one at a time,” so we want to find a way to 
tell at which step each of these terms essentially enters the addition chain. 

Let j be a number between 1 and t. Since Mo; is empty and M,; = M; is 
nonempty, we can find the first step i for which M;; is not empty. 

From the way in which the M;; are defined, we know that step 7 is a non- 
doubling: a; = a;_1+a, for some u < i—1. We also know that all the elements of 
Mj; are elements of Su. We will prove that a, must be relatively small compared 
to Qi. 

Let x; be an element of M;;. Then since x; € Su, there is some v for which 
zj E Mw. It follows that 

di — du > m, (45) 
that is, at least m+ 1 doublings occur between steps u and i. For if di — du < m, 
Lemma K tells us that |e; — e,| < 2m; hence v = j. But this is impossible, 
because Muj is empty by our choice of step i. 
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All elements of S,, are less than or equal to e; +d; — d. For if x € Su C Si 
and x > e, + d; — d, then z € Myo and x € Mio by (40); so Lemma K implies 
that |d; — du| < m, contradicting (45). In fact, this argument proves that Mio 
has no elements in common with Su, so Mu-1)o = Mio. From (44) we have 
aji-1 => 2Ala:) and therefore step i is a small step. 

We can now deduce what is probably the key fact in this entire proof: All 
elements of S,, are in My, . For if not, let z be an element of S„ with « Muo. 
Since x > 0, (40) implies that e1 > d — du, hence 


eo = ftd—s<227lst+d <2271(t—1) + e1 + dy. 


By hypothesis (43), this implies du > e1. But du E€ Su by (41), and it cannot be 
in Mio, hence du < e1 + di — d < e1, a contradiction. 

Going back to our element x; in Mij, we have x; € Muv; and we have proved 
that v = 0. Therefore, by equation (40) again, 


eo +d, — d > z; > eo + du — d — Mo. (46) 


For all j = 1, 2, ..., t we have determined a number 2; satisfying (46), 
and a small step 7 at which the term 2° may be said to have entered into the 
addition chain. If j Æ 7’, the step 7 at which this occurs cannot be the same for 
both j and j’; for (46) would tell us that |z; — zj | < m, while elements of Mj; 
and Mij; must differ by more than m, since ej and ej; are so far apart. We are 
forced to conclude that the chain contains at least t small steps; but this is a 
contradiction. J 


Theorem F (W. Hansen). 
(24 + zy) < A+ v(x) +vly)-1, if A(x) +A(y) < A. (47) 


Proof. An addition chain (which is not a star chain in general) may be con- 
structed by combining the binary and factor methods. Let x = 27! +... + 27» 
and y = 24% +... +24, where zı >- > £u > 0 and yy >- > Ww 2O. 

The first steps of the chain form successive powers of 2, until 247%! is 
reached; in between these steps, the additional values 27-1! + 27u, 27u-2 4 
27u-1 4274 |... and x are inserted in the appropriate places. After a chain up 
to 24-4 4 g(2917Y: 4... + 2-174) has been formed, we continue by adding x 
and doubling the resulting sum y; — y;+1 times; this yields 


gQA-vi41 + AO vit eee ee DUREE ji 
If this construction is done for i = 1, 2, ..., v, assuming for convenience that 
Yv+1 = 0, we have an addition chain for 24 + xy as desired. Į 


Theorem F enables us to find values of n for which I(n) < I*(n), since 
Theorem H gives an explicit value of /*(n) in certain cases. For example, let 
x = 21016 4 1, y = 27032 4 1, and let 
— 96103 | 93048 | 92032 , 91016 } 1, 


According to Theorem F, we have l(n) < 6106. But Theorem H also applies, 


with m = 508, and this proves that /*(n) = 6107. 
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Extensive computer calculations have shown that n = 12509 is the smallest 
value with I(n) < l*(n). No star chain for this value of n is as short as the 
sequence 1, 2, 4, 8, 16, 17, 32, 64, 128, 256, 512, 1024, 1041, 2082, 4164, 8328, 
8345, 12509. The smallest n with v(n) = 5 and I(n) Æ I*(n) is 16537 = 2449-17 
(see exercise 15). 

Jan van Leeuwen has generalized Theorem H to show that 


I*(k2) +t < (kn) < (k2%)+e9-—ex +t 


for all fixed k > 1, if the exponents eg > --- > e, are far enough apart [Crelle 
295 (1977), 202-207]. 


Some conjectures. Although it was reasonable to guess at first glance that 
I(n) = 1*(n), we have now seen that this is false. Another plausible conjecture 
[first made by A. Goulard, and supposedly “proved” by E. de Jonquiéres in 
L’Interméd. des Math. 2 (1895), 125-126] is that 1(2n) = 1(n)+1; a doubling step 
is so efficient, it seems unlikely that there could be any shorter chain for 2n than 
to add a doubling step to the shortest chain for n. But computer calculations 
show that this conjecture also fails, since 1(191) = 1(382) = 11. (A star chain of 
length 11 for 382 is not hard to find; for example, 1, 2, 4, 5, 9, 14, 23, 46, 92, 184, 
198, 382. The number 191 is minimal such that I(n) = 11, and it seems to be 
nontrivial to prove by hand that /(191) > 10. The author’s computer-generated 
proof of this fact, using a backtrack method that will be sketched in Section 
7.2.2, involved a detailed examination of 102 cases.) The smallest four values 
of n such that 1(2n) = I(n) are n = 191, 701, 743, 1111; E. G. Thurber proved 
in Pacific J. Math. 49 (1973), 229-242, that the third of these is a member of an 
infinite family of such n, namely 23-2% +7 for all k > 5. Neill Clift found in 2007 
that (n) = l(2n) = l(4n) = 31 when n = 30958077; and in 2008, astonishingly, 
he discovered that I(n) > 1(2n) = 34 when n = 375494703. Kevin R. Hebb has 
shown that [(n) — (mn) can get arbitrarily large, for all fixed integers m not a 
power of 2 [Notices Amer. Math. Soc. 21 (1974), A-294]. The smallest case in 
which (n) > U(mn) is 1((213 + 1)/3) = 15. 

Let c(r) be the smallest value of n such that l(n) = r. The computation 
of I(n) seems to be hardest for this sequence of n’s, which begins as follows: 


r elr) r c(r) r c(r) 

1 2 14 1087 27 2211837 
2 3 15 1903 28 4169527 
3 5 16 3583 29 7624319 
4 7 17 6271 30 14143037 
5 11 18 11231 31 25450463 
6 19 19 18287 32 46444543 
7T 29 20 34303 33 89209343 
8 47 21 65131 34 155691199 
9 71 22 110591 35 298695487 
10 127 23 196591 36 550040063 
11 191 24 357887 37 994660991 
12 379 25 685951 38 1886023151 


13 607 26 1176431 39 3502562143 
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For r < 11, the value of c(r) is approximately equal to c(r — 1) + c(r — 2), and 
this fact led to speculation by several people that c(r) grows like the function ¢’; 
but the result of Theorem D (with n = c(r)) implies that r/lgc(r) 4 1 as 
r — oo. The values listed here for r > 18 have been computed by Achim 
Flammenkamp, except that c(24) was first computed by Daniel Bleichenbacher, 
and c(29) through c(39) by Neill Clift. Flammenkamp notes that c(r) is fairly 
well approximated by the formula 2" exp(—@r/lgr) for 10 < r < 39, where 6 
is near In2; this agrees nicely with the upper bound (25). Several people had 
conjectured at one time that c(r) would always be a prime number, in view of 
the factor method; but c(15), c(18), and c(21) are all divisible by 11. Perhaps 
no conjecture about addition chains is safe! 

Tabulated values of [(n) show that this function is surprisingly smooth; for 
example, /(n) = 13 for all n in the range 1125 < n < 1148. The computer 
calculations show that a table of (n) may be prepared for 2 < n < 1000 by 
using the formula 


i(n) = min(I(n — 1) +1, In) — ôn, (48) 


where I, = œ if n is prime, otherwise l» = I(p)+1(n/p) if p is the smallest prime 
dividing n; and 6, = 1 for n in Table 1, ôn = 0 otherwise. 

Let d(r) be the number of solutions n to the equation I(n) = r. The following 
table lists the first few values of d(r), according to Flammenkamp and Clift: 


d(r) r d(r) r d(r) r dr) r d(r) r d(r) 
1 6 15 11 246 16 4490 21 90371 26 1896704 
7 26 12 432 17 8170 22 165432 27 3501029 
8 44 13 772 18 14866 23 303475 28 6465774 
9 78 14 1382 19 27128 24 558275 29 11947258 

10 136 15 2481 20 49544 25 1028508 30 22087489 


oRWNF A 


2 
3 
5 
9 


Surely d(r) must be an increasing function of r, but there is no evident way to 
prove this seemingly simple assertion, much less to determine the asymptotic 
growth of d(r) for large r. 


The most famous problem about addition chains that is still outstanding is 
the Scholz—Brauer conjecture, which states that 


2” —-1)<n-—1+4+l(n). (49) 


Notice that 2” —1 is the worst case for the binary method, because v(2”—1) = n. 
E. G. Thurber [Discrete Math. 16 (1976), 279-289] has shown that several of 
these values, including the case n = 32, can actually be calculated by hand. 
Computer calculations by Neill Clift [Computing 91 (2011), 265-284] show that 
1(2” — 1) is in fact exactly equal to n — 1 + 1(n) for 1 < n < 64. Arnold Scholz 
coined the name “addition chain” (in German) and posed (49) as a problem in 
1937 [Jahresbericht der Deutschen Mathematiker- Vereinigung, Abteilung II, 47 
(1937), 41-42]; Alfred Brauer proved in 1939 that 


*(2" 1) <n—-141*(n). (50) 


4.6.3 EVALUATION OF POWERS 479 


Table 1 
VALUES OF n FOR SPECIAL ADDITION CHAINS 


23 163 229 319 371 413 453 553 599 645 707 741 813 849 903 
43 165 233 323 373 419 455 557 611 659 709 749 825 863 905 
59 179 281 347 377 421 457 561 619 667 711 759 835 869 923 
77 203 283 349 381 423 479 569 623 669 713 779 837 887 941 
8&3 211 293 355 382 429 503 571 631 677 715 787 839 893 947 
107 213 311 359 395 437 509 573 637 683 717 803 841 899 955 
149 227 317 367 403 451 551 581 643 691 739 809 845 901 983 


Hansen’s theorems show that l(n) can be less than /*(n), so more work is 
definitely necessary in order to prove or disprove (49). As a step in this direction, 
Hansen has defined the concept of an 1°-chain, which lies “between” l-chains 
and /*-chains. In an 1°-chain, some of the elements are underlined; the condition 
is that a; = a; + ax, where a; is the largest underlined element less than a;. 

As an example of an /°-chain (certainly not a minimum one), consider 


1,2,4,5,8, 10, 12, 18; (51) 


it is easy to verify that the difference between each element and the previous 
underlined element is in the chain. We let /°(n) denote the minimum length of 
an 1°-chain for n. Clearly I(n) < 1°(n) < I*(n). 

Hansen pointed out that the chain constructed in Theorem F is an /°-chain 
(see exercise 22); and he also established the following improvement of Eq. (50): 


Theorem G. 19(2” —1) <n—1+1°(n). 


Proof. Let 1 = ag, a1, ..., ar =n be an 1°-chain of minimum length for n, and 
let 1 = bo, b1, ..., be = n be the subsequence of underlined elements. (We may 
assume that n is underlined.) Then we can get an /°-chain for 2” — 1 as follows: 
a) Include the /°(n) + 1 numbers 2% — 1, for 0 < i < r, underlined if and only 
if a; is underlined. 
b) Include the numbers 2'(2° — 1), for 0 < j < t and for 0 <i < bj41 — by, all 
underlined. (This is a total of bı — bo +--+ + by — b:-1 = n — 1 numbers.) 
c) Sort the numbers from (a) and (b) into ascending order. 

We may easily verify that this gives an 1°-chain: The numbers of (b) are all 
equal to twice some other element of (a) or (b); furthermore, this element is the 
preceding underlined element. If a; = bj + ax, where b; is the largest underlined 
element less than a;, then ak = aj — bj < bj+1 — bj, so 2* (284 —1) = 2% — 2% 
appears underlined in the chain, just preceding 2%: — 1. Since 2% — 1 is equal to 
(2% — 2%") + (2%* — 1), where both of these values appear in the chain, we have 
an addition chain with the Į? property. J 


The chain corresponding to (51), constructed in the proof of Theorem G, is 


1,2, 3, 6, 12, 15, 30, 31, 60, 120, 240, 255, 510, 1020, 1023, 2040, 
4080, 4095, 8160, 16320, 32640, 65280, 130560, 261120, 262143. 


Computations by Neill Clift have shown that I(n) < 1°(n) when n = 5784689 
(see exercise 42). This is the smallest case where Eq. (49) remains in doubt. 
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Graphical representation. An addition chain (1) corresponds in a natural 
way to a directed graph, where the vertices are labeled a; for 0 < i < r, and 
where we draw arcs from a; to a; and from a, to a; as a representation of each 
step a; = aj + a, in (2). For example, the addition chain 1, 2, 3, 6, 12, 15, 27, 
39, 78, 79 that appears in Fig. 15 corresponds to the directed graph 


1 2 3 6 12 15 27 (39) 78 (79) 


If a; = a; + a, for more than one pair of indices (j,k), we choose a definite j 
and k for purposes of this construction. 

In general, all but the first vertex of such a directed graph will be at the 
head of exactly two arcs; however, this is not really an important property of 
the graph, because it conceals the fact that many different addition chains can 
be essentially equivalent. If a vertex has out-degree 1, it is used in only one later 
step, hence the later step is essentially a sum of three inputs a; + ak + am that 
might be computed either as (a;+a,)+@m or as aj+(aķ+am) or as Ap+(Aj;+am). 
These three choices are immaterial, but the addition-chain conventions force us 
to distinguish between them. We can avoid such redundancy by deleting any 
vertex whose out-degree is 1 and attaching the arcs from its predecessors to its 
successor. For example, the graph above would become 


a) 


We can also delete any vertex whose out-degree is 0, except of course the final 
vertex ap, since such a vertex corresponds to a useless step in the addition chain. 

In this way every addition chain leads to a reduced directed graph that 
contains one “source” vertex (labeled 1) and one “sink” vertex (labeled n); 
every vertex but the source has in-degree > 2 and every vertex but the sink 
has out-degree > 2. Conversely, any such directed graph without oriented cycles 
corresponds to at least one addition chain, since we can topologically sort the 
vertices and write down d — 1 addition steps for each vertex of in-degree d > 0. 
The length of the addition chain, exclusive of useless steps, can be reconstructed 
by looking at the reduced graph; it is 


(number of arcs) — (number of vertices) + 1, (53) 


since deletion of a vertex of out-degree 1 also deletes one arc. 

We say that two addition chains are equivalent if they have the same reduced 
directed graph. For example, the addition chain 1, 2, 3, 6, 12, 15, 24, 39, 40, 79 
is equivalent to the chain we began with, since it also leads to (52). This example 
shows that a non-star chain can be equivalent to a star chain. An addition chain 
is equivalent to a star chain if and only if its reduced directed graph can be 
topologically sorted in only one way. 
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An important property of this graph representation has been pointed out 
by N. Pippenger: The label of each vertex is exactly equal to the number of 
oriented paths from the source to that vertex. Thus, the problem of finding an 
optimal addition chain for n is equivalent to minimizing the quantity (53) over 
all directed graphs that have one source vertex and one sink vertex and exactly 
n oriented paths from the source to the sink. 

This characterization has a surprising corollary, because of the symmetry of 
the directed graph. If we reverse the directions of all the arcs, the source and 
the sink exchange roles, and we obtain another directed graph corresponding to 
a set of addition chains for the same n; these addition chains have the same 
length (53) as the chain we started with. For example, if we make the arrows 
in (52) run from right to left, and if we relabel the vertices according to the 
number of paths from the right-hand vertex, we get 


One of the star chains corresponding to this reduced directed graph is 


1, 2, 4, 6, 12, 24, 26, 52, 78, 79; 


we may call this a dual of the original addition chain. 
Exercises 39 and 40 discuss important consequences of this graphical repre- 
sentation and the duality principle. 


EXERCISES 
1. [15] What is the value of Z when Algorithm A terminates? 


2. [24] Write a MIX program for Algorithm A, to calculate x” mod w given integers 
n and x, where w is the word size. Assume that MIX has the binary operations SRB, 
JAE, etc., that are described in Section 4.5.2. Write another program that computes 
x” mod w in a serial manner (multiplying repeatedly by x), and compare the running 
times of these programs. 


3. [22] How is x°” calculated by (a) the binary method? (b) the ternary method? 
(c) the quaternary method? (d) the factor method? 


4. [M20] Find a number n for which the octal (23-ary) method gives ten fewer 
multiplications than the binary method. 


5. [24] Figure 14 shows the first eight levels of the “power tree.” The (k+ 1)st level 
of this tree is defined as follows, assuming that the first k levels have been constructed: 
Take each node n of the kth level, from left to right in turn, and attach below it the 
nodes 

n+1, n+a1, N+ a2, ..., N+ ak-1 = 2N 


(in this order), where 1, a1, a2, ..., @g—1 is the path from the root of the tree to n; 
but discard any node that duplicates a number that has already appeared in the tree. 

Design an efficient algorithm that constructs the first r+1 levels of the power tree. 
[Hint: Make use of two sets of variables LINKU[j], LINKR[j] for O < j < 2"; these point 
upwards and to the right, respectively, if 7 is a number in the tree.] 
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6. [M26] Ifa slight change is made to the definition of the power tree that is given 
in exercise 5, so that the nodes below n are attached in decreasing order 


n+ ap-1, ---, N42, n+, n4+1 


instead of increasing order, we get a tree whose first five levels are 


16 12 


Show that this tree gives a method of computing x” that requires exactly as many 
multiplications as the binary method; therefore it is not as good as the power tree, 
although it has been constructed in almost the same way. 

7. [M21] Prove that there are infinitely many values of n 

a) for which the factor method is better than the binary method; 

b) for which the binary method is better than the factor method; 

c) for which the power tree method is better than both the binary and factor methods. 
(Here the “better” method is the one that computes x” using fewer multiplications. ) 

8. [M21] Prove that the power tree (exercise 5) never gives more multiplications for 
the computation of x” than the binary method. 

> 9. [25] Design an exponentiation procedure that is analogous to Algorithm A, but 
based on radix m = 2°. Your method should perform approximately lgn +v +m 
multiplications, where v is the number of nonzero digits in the m-ary representation 
of n. 
10. [10] Figure 15 shows a tree that indicates one way to compute x” with the fewest 
possible multiplications, for all n < 100. How can this tree be conveniently represented 
within a computer, in just 100 memory locations? 
> 11. [M26] The tree of Fig. 15 depicts addition chains ao, ai, ..., ar having I(ai) = i 

for all 7 in the chain. Find all addition chains for n that have this property, when 
n = 43 and when n = 77. Show that any tree such as Fig. 15 must include either the 
path 1, 2, 4, 8, 9, 17, 34, 43, 77 or the path 1, 2, 4, 8, 9, 17, 34, 68, 77. 
12. [M10] Is it possible to extend the tree shown in Fig. 15 to an infinite tree that 
yields a minimum-multiplication rule for computing x”, for all positive integers n? 
13. [M21] Find a star chain of length A + 2 for each of the four cases listed in 
Theorem C. (Consequently Theorem C holds also with | replaced by [*.) 
14. [M29] Complete the proof of Theorem C, by demonstrating that (a) step r—1 is 
not a small step; and (b) A(ar—x) cannot be less than m — 1, where m = X(ar-1). 


15. [M48] Write a computer program to extend Theorem C, characterizing all n such 
that l(n) = A(n) + 3 and characterizing all n such that l* (n) = A(n) +3. 

16. [HM15] Show that Theorem D is not trivially true just because of the binary 
method; if 1?(n) denotes the length of the addition chain for n produced by the binary 
S-and-X method, the ratio 1” (n)/A(n) does not approach a limit as n — oo. 
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17. [M25] Explain how to find the intervals Jı, ..., Ja that are required in the proof 
of Lemma P. 


18. [HM24] Let @ be a positive constant. Show that there is a constant a < 2 such 


that n 
z 


for all large m, where the sum is over all s, t, v satisfying (30). 


19. [M23] A “multiset” is like a set, but it may contain identical elements repeated 
a finite number of times. If A and B are multisets, we define new multisets A W B, 
AUB, and AN B in the following way: An element occurring exactly a times in A and 
b times in B occurs exactly a+ b times in AW B, exactly max(a, b) times in AU B, 
and exactly min(a, b) times in AN B. (A “set” is a multiset that contains no elements 
more than once; if A and B are sets, so are AU B and AN B, and the definitions given 
in this exercise agree with the customary definitions of set union and intersection.) 


a) The prime factorization of a positive integer n is a multiset N whose elements are 
primes, where Tee np=n. The fact that every positive integer can be uniquely 
factored into primes gives us a one-to-one correspondence between the positive 
integers and the finite multisets of prime numbers; for example, if n = 2? - 3? - 17, 
the corresponding multiset is N = {2,2,3,3,3,17}. If M and N are the multisets 
corresponding respectively to m and n, what multisets correspond to gcd(m, n), 
Icm(m,n), and mn? 

b) Every monic polynomial f(z) over the complex numbers corresponds in a natural 
way to the multiset F of its “roots”; we have f(z) = []-er(z—¢). E f(z) 
and g(z) are the polynomials corresponding to the finite multisets F and G of 
complex numbers, what polynomials correspond to F WG, FUG, and F N G? 

c) Find as many interesting identities as you can that hold between multisets, with 
respect to the three operations W, U, N. 


20. [M20] What are the sequences S; and Mi; (0 < i < r,0 < j < t) arising in 
Hansen’s structural decomposition of star chains (a) of Type 3? (b) of Type 5? (The 
” are defined in the proof of Theorem B.) 
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six “types 
21. [M26] (W. Hansen.) Let q be any positive integer. Find a value of n such that 
I(n) < U*(n) —@. 

22. [M20] Prove that the addition chain constructed in the proof of Theorem F is an 
1°-chain. 
23. [M20] Prove Brauer’s inequality (50). 


24. [M22] Generalize the proof of Theorem G to show that /°((B” ee 1)) 
(n — 1)1°(B)+1°(n), for any integer B > 1; and prove that I(2" — 1) <1(2™—-1 
mn —m+1°(n). 


25. [20] Let y bea fraction, 0 < y < 1, expressed in the binary number system as y = 
(.dı ...dk)2. Design an algorithm to compute z” using the operations of multiplication 
and square-root extraction. 


26. [M25] Design an efficient algorithm that computes the nth Fibonacci number Fn, 
modulo m, given large integers n and m. 


27. [M23] (A. Flammenkamp.) What is the smallest n for which every addition chain 
contains at least eight small steps? 
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28. [HM33| (A. Schénhage.) The object of this exercise is to give a fairly short proof 
that I(n) > A(n) + lg v(n) — O(log log(v(n) + 1)). 

a) When a = (a, ...%0.0-1...)2 and y = (Yr --- yo-y-1--.)2 are real numbers written 
in binary notation, let us write x C y if x; < y; for all 7. Give a simple rule for con- 
structing the smallest number z with the property that x’ C x and y’ C y implies 
x+y’ Cz. Denoting this number by xVy, prove that v(xVy) < v(x) + v(y). 

b) Given any addition chain (11) with r = I(n), let the sequence do, di, ..., dr 
be defined as in (35), and define the sequence Ao, Ai, ..., Ar by the following 
rules: Ao = 1; if a; = 2ai_-1 then A; = 2A;_1; otherwise if a; = aj + ax for some 
O0<k<j <i, then A; = Aj_1V(Ai-1/2%7%). Prove that this sequence “covers” 
the given chain, in the sense that a; C A; forO <i<r. 

c) Let ô be a positive integer (to be selected later). Call the nondoubling step a; = 
aj + ak a “baby step” if dj — dk > ô, otherwise call it a “close step.” Let Bo = 1; 
B; = 2Bi—ı if Qi = 2ai—1; Bi = By_-1V(Bi_-1/2%-*) if Qi = aj + Ak isa baby 
step; and B; = p(2Bj_-1) otherwise, where p(x) is the least number y such that 
x/2° C y for 0 < e< 6. Show that A; C Bi and v(B;) < (1+6c;)2" for 0 <i<r, 
where b; and c; respectively denote the number of baby steps and close steps < i. 
[Hint: Show that the 1s in Bi appear in consecutive blocks of size > 1 + ôc;.] 

d) We now have I(n) = r = br + cr + dy and v(n) < v(Br) < (1 + 6c,)2°". Explain 
how to choose 6 in order to obtain the inequality stated at the beginning of this 
exercise. [Hint: See (16), and note that n < 2"a°" for some a < 1 depending 
on 6.] 


29. [M49] (K. B. Stolarsky, 1969.) Is v(n) < 2!(-™) for all positive integers n? (If 
so, we have the lower bound I(2” — 1) > n — 1 + [lgn]; see (17) and (49).) 


30. [20] An addition-subtraction chain has the rule a; = a; + a, in place of (2); 
the imaginary computer described in the text has a new operation code, SUB. (This 
corresponds in practice to evaluating x” using both multiplications and divisions.) Find 
an addition-subtraction chain, for some n, that has fewer than I(n) steps. 


31. [M46] (D. H. Lehmer.) Explore the problem of minimizing eq + (r — q) in an 
addition chain (1), where q is the number of “simple” steps in which a; = ai-1 + 1, 
given a small positive “weight” e. (This problem comes closer to reality for many 
calculations of x”, if multiplication by x is simpler than a general multiplication; see 
the applications in Section 4.6.2.) 


32. [M30] (A.C. Yao, F. F. Yao, R. L. Graham.) Associate the “cost” ajag with each 
step a; = aj + ax of an addition chain (1). Show that the left-to-right binary method 
yields a chain of minimum total cost, for all positive integers n. 


33. [15] How many addition chains of length 9 have (52) as their reduced directed 
graph? 


34. [M23] The binary addition chain for n = 2°° +---+ 2°, when eo >--- > e¢ > 0, 
is L, 2; ocg 207TA oe Dy ae ee ee 20e Det? te ay Me. This 
corresponds to the S-and-X method described at the beginning of this section, while 
Algorithm A corresponds to the addition chain obtained by sorting the two sequences 
(1,2,4,...,2°°) and (2°-1+42°, 2°-2 + 2°-14 9%... n) into increasing order. Prove 
or disprove: Each of these addition chains is a dual of the other. 


35. [M27] How many addition chains without useless steps are equivalent to each of 
the addition chains discussed in exercise 34, when eo > e1 + 1? 
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> 36. [25] (E. G. Straus.) Find a way to compute a general monomial x}! a5?... £m” 
in at most 2\(max(n1,n2,...,%m)) + 2 — m — 1 multiplications. 
37. [HM30] (A. C. Yao.) Let I(ni,...,mm) be the length of the shortest addition 
chain that contains m given numbers ni < -> < mm. Prove that I(ni,...,mm) < 


A(nm) + MA(Nm)/AA(Nm) + O(A (Nm) AAA Mm) /AAm)*), thereby generalizing (25). 
38. [M47] What is the asymptotic value of (1,4,9,...,m”) — m, as m > oo, in the 
notation of exercise 37? 

> 39. [M25] (J. Olivos, 1979.) Let I([ni,n2,...,mm]) be the minimum number of mul- 


tiplications needed to evaluate the monomial xj*x5?...a77" in the sense of exercise 36, 
where each n; is a positive integer. Prove that this problem is equivalent to the problem 
of exercise 37, by showing that U([n1,n2,-..,%m]) = Uni, n2,-..,N%m) +m—1. [Hint: 
Consider directed graphs like (52) that have more than one source vertex.] 


> 40. [M21] (J. Olivos.) Generalizing the factor method and Theorem F, prove that 


Umini +--+ ment) < Umi,...,me) + Umi, ..., ne) +t-1, 


where I(ni,..., n+) is defined in exercise 37. 
41. [M40] (P. Downey, B. Leong, R. Sethi.) Let G be a connected graph with n 
vertices {1,...,n} and m edges, where the edges join uj to vj for 1 < j < m. Prove 


that 1(1,2,..., 247,241 424741, ..,,Q4um494%m 41) = An+m-+k for all sufficiently 
large A, where k is the minimum number of vertices in a vertex cover for G (namely a 
set that contains either u; or vj for 1 < j < m). 

42. [22] (Neill Clift, 2005.) Show that neither 1, 2, 4, 8, 16, 32, 64, 65, 97, 128, 
256, 353, 706, 1412, 2824, 5648, 11296, 22592, 45184, 90368, 180736, 361472, 361537, 
723074, 1446148, 2892296, 5784592, 5784689 nor its dual is an 1°-chain. 


43. [M50] Is 1(2"—1) < n—1+1(n) for all integers n > 0? Does equality always hold? 


4.6.4. Evaluation of Polynomials 


Now that we know efficient ways to evaluate the special polynomial x”, let us 
consider the general problem of computing an nth degree polynomial 


u(t) = Une” + Une" 1 + +e + uo, Un # 0, (1) 


for given values of x. This problem arises frequently in practice. 

In the following discussion we shall concentrate on minimizing the number of 
operations required to evaluate polynomials by computer, blithely assuming that 
all arithmetic operations are exact. Polynomials are most commonly evaluated 
using floating point arithmetic, which is not exact, and different schemes for 
the evaluation will, in general, give different answers. A numerical analysis of 
the accuracy achieved depends on the coefficients of the particular polynomial 
being considered, and is beyond the scope of this book; the reader should be 
careful to investigate the accuracy of any calculations undertaken with floating 
point arithmetic. In most cases the methods we shall describe turn out to be 
reasonably satisfactory from a numerical standpoint, but many bad examples 
can also be given. [See Webb Miller, SICOMP 4 (1975), 97-107, for a survey of 
the literature on stability of fast polynomial evaluation, and for a demonstration 
that certain kinds of numerical stability cannot be guaranteed for some families 
of high-speed algorithms. | 
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Throughout this section we will act as if the variable x were a single number. 
But it is important to keep in mind that most of the methods we will discuss 
are valid also when the variables are large objects like multiprecision numbers, 
polynomials, or matrices. In such cases efficient formulas lead to even bigger 
payoffs, especially when we can reduce the number of multiplications. 

A beginning programmer will often evaluate the polynomial (1) in a man- 
ner that corresponds directly to its conventional textbook form: First u,x” is 
calculated, then u;_12"~ +, ..., wiz, and finally all of the terms of (1) are added 
together. But even if the efficient methods of Section 4.6.3 are used to evaluate 
the powers of x in this approach, the resulting calculation is needlessly slow 
unless nearly all of the coefficients uz are zero. If the coefficients are all nonzero, 
an obvious alternative would be to evaluate (1) from right to left, computing the 
values of z” and upg! +---+ ug for k = 1, ..., n. Such a process involves 2n — 1 
multiplications and n additions, and it might also require further instructions to 
store and retrieve intermediate results from memory. 


Horner’s rule. One of the first things a novice programmer is usually taught 
is an elegant way to rearrange this computation, by evaluating u(x) as follows: 

u(x) = (... (Unt + un-1)£ +) £ + up. (2) 
Start with un, multiply by x, add un_1, multiply by z, ..., multiply by zx, 
add up. This form of the computation is usually called “Horner’s rule”; we have 
already seen it used in connection with radix conversion in Section 4.4. The 
entire process requires n multiplications and n additions, minus one addition 
for each coefficient that is zero. Furthermore, there is no need to store partial 
results, since each quantity arising during the calculation is used immediately 
after it has been computed. 

W. G. Horner gave this rule early in the nineteenth century [Philosophical 
Transactions, Royal Society of London 109 (1819), 308-335] in connection with 
a procedure for calculating polynomial roots. The fame of the latter method [see 
J. L. Coolidge, Mathematics of Great Amateurs (Oxford, 1949), Chapter 15] 
accounts for the fact that Horner’s name has been attached to (2); but actually 
Isaac Newton had made use of the same idea more than 150 years earlier. For 
example, in a well-known work entitled De Analysi per Æquationes Infinitas, 
originally written in 1669, Newton wrote 


y=4xy:+5xy:—12xy:+17 

for the polynomial y* — 4y? + 5y? — 12y + 17, while illustrating what later came 
to be known as Newton’s method for rootfinding. This clearly shows the idea 
of (2), since he often denoted grouping by using horizontal lines and colons 
instead of parentheses. Newton had been using the idea for several years in 
unpublished notes. [See The Mathematical Papers of Isaac Newton, edited by 
D. T. Whiteside, 1 (1967), 490, 531; 2 (1968), 222.] Independently, a method 
equivalent to Horner’s had in fact been used in 13th-century China by Ch’in 
Chiu-Shao [see Y. Mikami, The Development of Mathematics in China and Japan 
(1913), 73-77]. 
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Several generalizations of Horner’s rule have been suggested. Let us first 
consider evaluating u(z) when z is a complex number, while the coefficients ug 
are real. In particular, when z = ef? = cos@ + isin@, the polynomial u(z) is 
essentially two Fourier series, 


(uo + u1 cos0 +--+ Un cosnd) + i(uy sind +- -- + un sin nð). 


Complex addition and multiplication can obviously be reduced to a sequence of 
ordinary operations on real numbers: 


real + complex requires 1 addition 

complex + complex requires 2 additions 

real x complex requires 2 multiplications 

complex x complex requires 4 multiplications, 2 additions 
or 3 multiplications, 5 additions 


(See exercise 41. Subtraction is considered here as if it were equivalent to 
addition.) Therefore Horner’s rule (2) uses either 4n — 2 multiplications and 
3n — 2 additions or 3n — 1 multiplications and 6n — 5 additions to evaluate u(z) 
when z = x+y is complex. Actually 2n—4 of these additions can be saved, since 
we are multiplying by the same number z each time. An alternative procedure 
for evaluating u(x + iy) is to let 


a, = Un, by = Un—1, r=n+a, s=2?+y"; 
aj = 05-1 + TA;-1, bj = Un—j — Sūj—1, 1 <jg<cn. (3) 


Then it is easy to prove by induction that u(z) = za, +b,. This scheme [BIT 5 
(1965), 142; see also G. Goertzel, AMM 65 (1958), 34-35] requires only 2n + 2 
multiplications and 2n + 1 additions, so it is an improvement over Horner’s rule 
when n > 3. In the case of Fourier series, when z = et, we have s = 1, so the 
number of multiplications drops to n +1. The moral of this story is that a good 
programmer does not make indiscriminate use of the built-in complex-arithmetic 
features of high-level programming languages. 

Consider the process of dividing the polynomial u(x) by x — xo, using 
Algorithm 4.6.1D to obtain u(x) = (x — 20) q(x) + r(x); here deg(r) < 1, so r(x) 
is a constant independent of x, and u(xo) = 0 - ¢(ao) +r =r. An examination 
of this division process reveals that the computation is essentially the same as 
Horner’s rule for evaluating u(xo). Similarly, if we divide u(z) by the polynomial 
(z — 20)(z — Zo) = 2? — 2zoz + x2 + y2, the resulting computation turns out to 
be equivalent to (3); we obtain u(z) = (z — zo)(z — Z0)q(z) + anz + bn, hence 
u( zo) = anzo + bn. 

In general, if we divide u(x) by f(x) to obtain u(x) = f(x)q(x) + r(x), 
and if f(a) = 0, we have u(zọ) = r(xo); this observation leads to further 
generalizations of Horner’s rule. For example, we may let f(x) = x? — x; this 
yields the “second-order” Horner’s rule 


u(x) = (. a (U2}n/2j 2° + Ua|n/2}—2) Ae as -) x + uo 
+ ((... (uapnyo}-127 + Uafajaj—a)e7+++-)e27+m)e. (4) 
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The second-order rule uses n+1 multiplications and n additions (see exercise 5); 
so it is no improvement over Horner’s rule from this standpoint. But there are at 
least two circumstances in which (4) is useful: If we want to evaluate both u(x) 
and u(—«), this approach yields u(—a) with just one more addition operation; 
two values can be obtained almost as cheaply as one. Moreover, if we have a 
computer that allows parallel computations, the two lines of (4) may be evaluated 
independently, so we save about half the running time. 

When our computer allows parallel computation on k arithmetic units at 
once, a “kth-order” Horner’s rule (obtained in a similar manner from f(x) = 
z= af) may be used. Another attractive method for parallel computation has 
been suggested by G. Estrin [Proc. Western Joint Computing Conf. 17 (1960), 
33-40]; for n = 7, Estrin’s method is: 


Processor 1 Processor 2 Processor 3 Processor 4 Processor 5 
a,=uUu7xr+uge by = u5tr+u4 Cr = Uugxtug dy = uzt + uo y? 
ag = ax? =j bi C2 = ar? + dy x 


a3 = aszt T C2 


Here a3 = u(x). However, an interesting analysis by W. S. Dorn [IBM J. Res. 
and Devel. 6 (1962), 239-245] shows that these methods might not actually be 
an improvement over the second-order rule, if each arithmetic unit must access 
a memory that communicates with only one processor at a time. 


Tabulating polynomial values. If we wish to evaluate an nth degree polyno- 
mial at many points in an arithmetic progression (that is, if we want to calculate 
u(xo), u(zo + h), u(zo + 2h), ...), the process can be reduced to addition 
only, after the first few steps. For if we start with any sequence of numbers 
(Q0,Q1,---,;@n) and apply the transformation 


ag Qao + Q1, Qi & A, F2, ~., An-1“— An-1+ On, (5) 


we find that k applications of (5) yields 


k k k i 
aP = (0) 85+ (7) Bin + (5) int OSJ <n, 
where 3; denotes the initial value of a; and 8; = 0 for j >n. In particular, 
; k k k 
aP = (o+ (G) Bite + (0) Pn (6) 
0 1 n 


is a polynomial of degree n in k. By properly choosing the (6’s, as shown 
in exercise 7, we can set things up so that this quantity af is the desired 
value u(zo + kh), for all k. In other words, each execution of the n additions in 
(5) will produce the next value of the given polynomial. 

Caution: Rounding errors can accumulate after many repetitions of (5), and 
an error in aj produces a corresponding error in the coefficients of x°, ..., xi 
in the polynomial being computed. Therefore the values of the a’s should be 
“refreshed” after a large number of iterations. 
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Derivatives and changes of variable. Sometimes we want to find the coeffi- 
cients of u(x + xo), given a constant xo and the coefficients of u(x). For example, 
if u(x) = 3x? +2x—1, then u(x—2) = 3z? —10x+7. This is analogous to a radix 
conversion problem, converting from base x to base x + 2. By Taylor’s theorem, 
the new coefficients are given by the derivatives of u(x) at £x = xo, namely 


u(a + xo) = u(x) + u'(zo)x + (ul (ao)/2!)a? +++» + (u(ao)/n!)a”, (7) 
so the problem is equivalent to evaluating u(x) and all its derivatives. 

If we write u(x) = q(x)(x — zo) + r, then u(x + zo) = q(x + 20)" + r; so 

r is the constant coefficient of u(x + zo), and the problem reduces to finding the 


coefficients of q(x + xo), where q(x) is a known polynomial of degree n — 1. Thus 
the following algorithm is indicated: 


H1. Set vj + uj forO<j<n. 
H2. For k = 0, 1, ..., n — 1 (in this order), set vj + vj + £ovj+ı for j =n — 1, 
,k+1, k (in this order). | 


At the conclusion of step H2 we have u(x + zo) = Ung” +--+ +vix + vo. This 
procedure was a principal part of Horner’s root-finding method, and when k = 0 
it is exactly rule (2) for evaluating u(xo). 

Horner’s method requires (n?+n)/2 multiplications and (n?+n)/2 additions; 
but notice that if xo = 1 we avoid all of the multiplications. Fortunately we can 
reduce the general problem to the case xp = 1 by introducing comparatively few 
multiplications and divisions: 


S1. Compute and store the values 22, ..., £3. 

S2. Set vj + ujx3 for 0 < j <n. (Now v(x) = u(zoz).) 

S3. Perform step H2 but with zo = 1. (Now v(x) = u(xo(£x+1)) = u(xor+20).) 
S4. Set vj + vj/xå for 0 < j <n. (Now v(x) = u(x + zo) as desired.) | 


This idea, due to M. Shaw and J. F. Traub [JACM 21 (1974), 161-167], has the 

same number of additions and the same numerical stability as Horner’s method; 

but it needs only 2n—1 multiplications and n—1 divisions, since Un = Un. About 

in of these multiplications can, in turn, be avoided (see exercise 6). 

If we want only the first few or the last few derivatives, Shaw and Traub 
have observed that there are further ways to save time. For example, if we just 
want to evaluate u(x) and u’(x), we can do the job with 2n — 1 additions and 
about n + /2n multiplications/divisions as follows: 

D1. Compute and store the values x”, £’, ..., xt, a?! where t = |y/n/2]. 

D2. Set vj — ujaf for 0 < j <n, where FO) =t— 1-— ((n— 1 -— j) mod 2t) 
for 0 < j < n, and f(n) =t. 

D3. Set vj 4 vj tvj4129 for j =n—1, ..., 1, 0; here g(j) = 2t when n—-1— j 
is a positive multiple of 2t, diwe aC i = 0 and the multiplication by 
299) need not be done. 

D4. Set vj + vj +0j41299) for j =n—1,..., 2, 1. Now vp/zf = u(x) and 
vı/xf0) = u'(x£). | 
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Adaptation of coefficients. Let us now return to our original problem of 
evaluating a given polynomial u(x) as rapidly as possible, for “random” values 
of x. The importance of this problem is due partly to the fact that standard 
functions such as sin x, cos z, e”, etc., are usually computed by subroutines that 
rely on the evaluation of certain polynomials; such polynomials are evaluated so 
often, it is desirable to find the fastest possible way to do the computation. 

Arbitrary polynomials of degree five and higher can be evaluated with fewer 
operations than Horner’s rule requires, if we first “adapt” or “precondition” 
the coefficients uo, u1, ..., Un. This adaptation process might involve a lot of 
work, as explained below; but the preliminary calculation is not wasted, since 
it must be done only once while the polynomial will be evaluated many times. 
For examples of “adapted” polynomials for standard functions, see V. Y. Pan, 
USSR Computational Math. and Math. Physics 2 (1963), 137-146. 

The simplest case for which adaptation of coefficients is helpful occurs for a 
fourth degree polynomial: 


u(x) = uT + ugx® + ue? + ux + uo, us # 0. (8) 
This equation can be rewritten in a form originally suggested by T. S. Motzkin, 
y=(etao)etom, u(t) = ((y +£ + a)y + as)as, (9) 


for suitably “adapted” coefficients ap, &1, @2, &3, a4. The computation in this 
scheme involves three multiplications, five additions, and (on a one-accumulator 
machine like MIX) one instruction to store the partial result y into temporary 
storage. By comparison with Horner’s rule, we have traded a multiplication for 
an addition and a possible storage command. Even this comparatively small 
change is worthwhile if the polynomial is to be evaluated often. (Of course, if 
the time for multiplication is comparable to the time for addition, (9) gives no 
improvement; we will see that a general fourth-degree polynomial always requires 
at least eight arithmetic operations for its evaluation.) 

By equating coefficients in (8) and (9), we obtain formulas for computing 
the a,;’s in terms of the ux’s: 


ao = 5(us/us — 1), B = u2/u4 — ao(ao + 1), ay = u1 /u4 — aof, 


a2 = B — 2aı, a3 = uo /u4 — a1 (Qı + a2), Q4 = Ua. (10) 


A similar scheme, which evaluates a fourth-degree polynomial in the same num- 
ber of steps as (9), appears in exercise 18; this alternative method will give 
greater numerical accuracy than (9) in certain cases, although it yields poorer 
accuracy in others. 

Polynomials that arise in practice often have a rather small leading coeffi- 
cient, so that the division by u4 in (10) leads to instability. In such a case it 
is usually preferable to replace x by |u4|!/4x as the first step, reducing (8) to a 
polynomial whose leading coefficient is +1. A similar transformation applies to 
polynomials of higher degrees. This idea is due to C. T. Fike [CACM 10 (1967), 
175-178], who has presented several interesting examples. 


4.6.4 EVALUATION OF POLYNOMIALS 491 


Any polynomial of the fifth degree may be evaluated using four multiplica- 
tions, six additions, and one storing, by using the rule u(x) = U(x)x+ ug, where 
U(x) = usz*+ugr? +uzar?+u2r+u; is evaluated as in (9). Alternatively, we can 
do the evaluation with four multiplications, five additions, and three storings, 
if the calculations take the form 


y=(x+ao)*, u(x) = (((y +aı)y + a2)(£ + a3) + as)as. (11) 


The determination of the a’s this time requires the solution of a cubic equation 
(see exercise 19). 

On many computers the number of “storing” operations required by (11) is 
less than 3; for example, we may be able to compute (x + ag)? without storing 
x+ag. In fact, most computers nowadays have more than one arithmetic register 
for floating point calculations, so we can avoid storing altogether. Because of 
the wide variety of features available for arithmetic on different computers, we 
shall henceforth in this section count only the arithmetic operations, not the 
operations of storing and loading an accumulator. The computation schemes 
can usually be adapted to any particular computer in a straightforward manner, 
so that very few of these auxiliary operations are necessary; on the other hand, 
it must be remembered that overhead costs might well overshadow the fact that 
we are saving a multiplication or two, especially if the machine code is being 
produced by a compiler that does not optimize. 

A polynomial u(x) = ugx® + +--+ ura + uo of degree six can always be 
evaluated using four multiplications and seven additions, with the scheme 


z = (x + ao) + a1, w = (x + a2)z + a3, 
u(x) = ((w + z + a4)w + as) as. (12) 


[See D. E. Knuth, CACM 5 (1962), 595-599.] This saves two of the six multi- 
plications required by Horner’s rule. Here again we must solve a cubic equation: 
Since ag = ug, we may assume that ug = 1. Under this assumption, let 


Bi = (us — 1)/2, b2 = ua — Bi (81 + 1), 
b3 = u3 — Br Ba, Ga = b1 — Bo, bs = u2 — 81 (3. 


Let 8e be a real root of the cubic equation 


2y? + (284 — Bo + 1)y? + (285 — b264 — Bs)y + (u1 — b285) = 0. (13) 


(This equation always has a real root, since the polynomial on the left approaches 
+oo for large positive y, and it approaches —co for large negative y; it must 
assume the value zero somewhere in between.) Now if we define 


Br = bE + Babe + Bs, Bg = b3 — Be — Br, 


we have finally 


ag = b2 — 286, az = pı — a0, a1 = Be — Q002, 


a3 = p7 — A102, a4 = ps — By — Q1, a5 = uo — rbs. (14) 
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We can illustrate this procedure with a contrived example: Suppose that we 
want to evaluate xê + 13x5 + 4924 + 33x — 61a? — 37x +3. We obtain ag = 1, 
Bı = 6, B2 = 7, b3 9, 64 = —1, 65 = —7, and so we meet with the cubic 
equation 


2y? — 8y? + 2y+12=0. (15) 
This equation has 6, = 2 as a root, and we continue to find 
Br = —5, Bs = —6, 
ag=3, a2=3, ar=-7, ag=16, as=6, a5 = —27. 
The resulting scheme is therefore 
z= (x+3)z-7, w = (x +3)z + 16, u(x) = (w + z + 6)w — 27. 


By sheer coincidence the quantity x + 3 appears twice here, so we have found a 
method that uses three multiplications and six additions. 

Another method for handling sixth-degree equations has been suggested by 
V. Y. Pan [Problemy Kibernetiki 5 (1961), 17-29]. His method requires one more 
addition operation, but it involves only rational operations in the preliminary 
steps; no cubic equation needs to be solved. We may proceed as follows: 


z=(a@+ao)at+q, w= z +T +2, 
J 


u(x) = (((z — £ + az3)w + a4) z + as) as. (16) 


To determine the a’s, we divide the polynomial once again by ug = a6 so that 
u(x) becomes monic. It can then be verified that ao = us /3 and that 


a = (ui — aguz + agus — agus + 208) /(uz — 2aous + 508). (17) 


Note that Pan’s method requires that the denominator in (17) does not vanish. 
In other words, (16) can be used only when 


27ugue — 18ugusug + bus #0; (18) 


in fact, this quantity should not be so small that a, becomes too large. Once aj 
has been determined, the remaining a’s may be determined from the equations 


By = 2a0, Bo = u4 — A081 — 04, 
b3 = u3 — ao b2 — 21/1, Ba = u2 — 0083 — a1 b2, 

as = 1 (b3 — (ao — 1) Be + (ao — 1) (af — 1)) — a1, 

a2 = b2 — (aĝ — 1) — az — 2a1, a4 = Ba — (@2 + a1) (a3 + a1), 

as = uo — Q1 p4. (19) 


We have discussed the cases of degree n = 4, 5, 6 in detail because the 
smaller values of n arise most frequently in applications. Let us now consider 
a general evaluation scheme for nth degree polynomials, a method that involves 
at most |n/2| +2 multiplications and n additions. 
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Theorem E. Every nth degree polynomial (1) with real coefficients, n > 3, 
can be evaluated by the scheme 


= = y2 = (uny + ao)y + Bo, n even, 
yY=U+C, w=; = eae: 


u(x) = (. .. ((z(w — a1) + 61)(w — a2) + b2) .. .) (w — am) + Bm, (20) 


for suitable real parameters c, a, and Bk, where m = [n/2|—1. In fact, it is 
possible to select these parameters so that Bm = 0. 


Proof. Let us first examine the circumstances under which the a’s and §’s can 
be chosen in (20), if c is fixed. Let 


p(x) = u(x — c) = ant” + ane" +--+ ax + ao. (21) 


We want to show that p(x) has the form p1(x)(x?—am)+m for some polynomial 


pi(x) and some constants Am, Bm. If we divide p(x) by £? —am, we can see that 
the remainder m is a constant only if the auxiliary polynomial 


g(a) = Qamqit™ + dome +- + ay, (22) 


formed from every odd-numbered coefficient of p(x), is a multiple of £ — am. 
Conversely, if q(x) has £ — a as a factor, then p(x) = pı (x)(x? — am) + Bm, for 
some constant m that may be determined by division. 

Similarly, we want pi(2) to have the form po(x)(x? — am-1) + Bm—1, and 
this is the same as saying that q(x)/(x — am) is a multiple of x — a» _1; for 
if qı(x) is the polynomial corresponding to pı(x) as q(x) corresponds to p(x), 
we have q(x) = q(x)/(x — a»). Continuing in the same way, we find that the 
parameters a1, 01, ---, Qm, Bm will exist if and only if 


g(a) = dom4i(a@ — Q1)... (£ — Am). (23) 


In other words, either q(x) is identically zero (and this can happen only when n 
is even), or else q(x) is an mth degree polynomial having all real roots. 

Now we have a surprising fact discovered by J. Eve [Numer. Math. 6 (1964), 
17-21]: If p(x) has at least n — 1 complex roots whose real parts are all nonneg- 
ative, or all nonpositive, then the corresponding polynomial q(x) is identically 
zero or has all real roots. (See exercise 23.) Since u(x) = 0 if and only if 
p(x +c) = 0, we need merely choose the parameter c large enough that at least 
n—1 of the roots of u(x) = 0 have a real part > —c, and (20) will apply whenever 
an—1 = Un—1 — NCUn Æ 0. 

We can also determine c so that these conditions are fulfilled and also that 
Bm = 0. First the n roots of u(x) = 0 are determined. If a + bi is a root having 
the largest or the smallest real part, and if b 4 0, let c = —a and am = —b?; 
then z? — am is a factor of u(x — c). If the root with smallest or largest real part 
is real, but the root with second smallest (or second largest) real part is nonreal, 
the same transformation applies. If the two roots with smallest (or largest) real 
parts are both real, they can be expressed in the form a—b and a+b, respectively; 
let c= —a and am = b’. Again x? — ap is a factor of u(x — c). (Still other values 
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of c are often possible; see exercise 24.) The coefficient a,,_ ; will be nonzero for 
at least one of these alternatives, unless q(x) is identically zero. J 


Note that this method of proof usually gives at least two values of c, and we 
also have the chance to permute ay, ..., @m—1 in (m — 1)! ways. Some of these 
alternatives may give more desirable numerical accuracy than others. 

Questions of numerical accuracy do not arise, of course, when we are working 
with integers modulo m instead of with real numbers. Scheme (9) works for 
n = 4 when m is relatively prime to 2u4, and (16) works for n = 6 when m 
is relatively prime to 6ug and to the denominator of (17). Exercise 44 shows 
that n/2+ O(log n) multiplications and O(n) additions suffice for any monic nth 
degree polynomial modulo any m. 


*Polynomial chains. Now let us consider questions of optimality. What are 
the best possible schemes for evaluating polynomials of various degrees, in terms 
of the minimum possible number of arithmetic operations? This question was 
first analyzed by A. M. Ostrowski in the case that no preliminary adaptation 
of coefficients is allowed [Studies in Mathematics and Mechanics Presented to 
R. von Mises (New York: Academic Press, 1954), 40-48], and by T. S. Motzkin 
in the case of adapted coefficients [see Bull. Amer. Math. Soc. 61 (1955), 163]. 

In order to investigate this question, we can extend Section 4.6.3’s concept 
of addition chains to the notion of polynomial chains. A polynomial chain is a 
sequence of the form 


=A; At, se Ar =U(Z), (24) 
where u(x) is some polynomial in x, and for 1 <i<r 
either A; = (£A;) © Ag, 0<j,k<i, 
or Ài = Qj 0 Àk, 0<k<i. 


(25) 
Here “o” denotes any of the three operations “+”, “—”, or “x”, and œj denotes 
a so-called parameter. Steps of the first kind are called chain steps, and steps 
of the second kind are called parameter steps. We shall assume that a different 
parameter a; is used in each parameter step; if there are s parameter steps, they 
should involve a1, a2, ..., Œs in this order. 

It follows that the polynomial u(x) at the end of the chain has the form 


u(a) = gna" +--+ qT + qo, (26) 
where qn, ---, q1, qo are polynomials in a1, a2, ..., @s with integer coefficients. 
We shall interpret the parameters a1, @2, ..., Œs as real numbers, and we shall 


therefore restrict ourselves to considering the evaluation of polynomials with 
real coefficients. The result set R of a polynomial chain is defined to be the 


set of all vectors (dn,---,41,90) of real numbers that occur as a1, Q2, ..., Qs 
independently assume all possible real values. 
If for every choice of t + 1 distinct integers jo, ..., je € {0,1,...,n} there 


is a nonzero multivariate polynomial f;,...;, with integer coefficients such that 
Fo-je(Gjor+ +++ Qj) = O for all (qn,---,91,90) in R, let us say that the result 
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set R has at most t degrees of freedom, and that the chain (24) has at most t 
degrees of freedom. We also say that the chain (24) computes a given polynomial 
u(x) = Unt” +--+ + ura + uo if (Un,...,U1,Uo) is in R. It follows that a 
polynomial chain with at most n degrees of freedom cannot compute all nth 
degree polynomials (see exercise 27). 

As an example of a polynomial chain, consider the following chain corre- 
sponding to Theorem E, when n is odd: 


Ao = g 

Ay = a, + ro 

Ag = Ax A 

A3 = Ag X Ay (27) 
M1431 = A1425 + Asi 
A243) = A242) + A2 L<i<n/2. 


A3+3i = A143i X A243: 


There are |n/2| +2 multiplications and n additions; |n/2] +1 chain steps and 
n + 1 parameter steps. By Theorem E, the result set R includes the set of all 
(Un,---,;U1,Uo0) With un Æ 0, so (27) computes all polynomials of degree n. We 
cannot prove that R has at most n degrees of freedom, since the result set has 
n + 1 independent components. 

A polynomial chain with s parameter steps has at most s degrees of freedom. 
Ina sense, this is obvious: We can’t compute a function with t degrees of freedom 
using fewer than ¢ arbitrary parameters. But this intuitive fact is not easy 
to prove formally; for example, there are continuous functions (“space-filling 
curves”) that map the real line onto a plane, and such functions map a single 
parameter into two independent parameters. For our purposes, we need to verify 
that no polynomial functions with integer coefficients can have such a property; 
a proof appears in exercise 28. 

Given this fact, we can proceed to prove the results we seek: 


Theorem M (T. S. Motzkin, 1954). A polynomial chain with m > 0 multipli- 
cations has at most 2m degrees of freedom. 


Proof. Let 41, H2, ---; Hm be the »,;’s of the chain that are multiplication 
operations. Then 


Hi = Soi-1 x Soi for 1 ne 1 <m and u(x) = Som-+15 (28) 


where each S; is a certain sum of j1’s, x’s, and a’s. Write S; = Tj + 8;, where 
T} is a sum of p’s and «’s while 8; is a sum of a’s. 

Now u(x) is expressible as a polynomial in x, (1, ..., B2m+1 with integer 
coefficients. Since the 6’s are expressible as linear functions of aj, ..., as, the 
set of values represented by all real values of 6), ..., B2m+1 contains the result 
set of the chain. Therefore there are at most 2m +1 degrees of freedom; this can 
be improved to 2m when m > 0, as shown in exercise 30. | 
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An example of the construction in the proof of Theorem M appears in 
exercise 25. A similar result can be proved for additions: 


Theorem A (E. G. Belaga, 1958). A polynomial chain containing q additions 
and subtractions has at most q+ 1 degrees of freedom. 


Proof. [Problemy Kibernetiki 5 (1961), 7-15.] Let «1, ..., £q be the ,’s of the 
chain that correspond to addition or subtraction operations. Then 


ki = Toi- E To; for 1<i<gq and u(x) = Tog+1, (29) 


where each T} is a product of k’s, x’s, and a’s. We may write Tj = A; Bj, 
where A; is a product of a’s and B; is a product of &’s and a’s. The following 
transformation may now be made to the chain, successively for i = 1, 2,..., q: 
Let Êi = Ag; /A2i—1, so that Ki = Ao; 1( Bo; 1 Bi Boi). Then change Ki 
to +Bo;_1 + i Bo;, and replace each occurrence of k; in future formulas T>;+1, 
Toi+2, ---, Toq41 by A2i—1Ki. (This replacement may change the values of Azi+1, 


Agi+2; seg Agg+1-) 

After the transformation has been done for all i, let 6741 = Agg+1; then u(x) 
can be expressed as a polynomial in 61, ..., 8g41, and x, with integer coefficients. 
We are almost ready to complete the proof, but we must be careful because the 
polynomials obtained as 61, ..., 841 range over all real values may not include 
all polynomials representable by the original chain (see exercise 26); it is possible 
to have Ay;_; = 0, for some values of the a’s, and this makes 8; undefined. 

To complete the proof, let us observe that the result set R of the original 
chain can be written R = Ri U R2 U---UR, UR’, where R; is the set of result 
vectors possible when Az;_; = 0, and where R’ is the set of result vectors possible 
when all a’s are nonzero. The discussion above proves that R’ has at most q+ 1 
degrees of freedom. If Ag;-1; = 0, then To;_1 = 0, so addition step si may be 
dropped to obtain another chain computing the result set Ri; by induction we 
see that each R; has at most q degrees of freedom. Hence by exercise 29, R has 
at most q + 1 degrees of freedom. J 


Theorem C. If a polynomial chain (24) computes all nth degree polynomials 
u(x) = Unt” + +--+ uo, for some n > 2, then it includes at least |n/2| +1 
multiplications and at least n addition-subtractions. 


Proof. Let there be m multiplication steps. By Theorem M, the chain has at 
most 2m degrees of freedom, so 2m > n +1. Similarly, by Theorem A there are 
> n addition-subtractions. J] 


This theorem states that no single method having fewer than |n/2| + 1 
multiplications or fewer than n additions can evaluate all possible nth degree 
polynomials. The result of exercise 29 allows us to strengthen this and say that no 
finite collection of such polynomial chains will suffice for all polynomials of a given 
degree. Some special polynomials can, of course, be evaluated more efficiently; 
all we have really proved is that polynomials whose coefficients are algebraically 
independent, in the sense that they satisfy no nontrivial polynomial equation, 
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require |n/2| + 1 multiplications and n additions. Unfortunately the coeffi- 
cients we deal with in computers are always rational numbers, so the theorems 
above don’t really apply; in fact, exercise 42 shows that we can always get by 
with O(,/n ) multiplications (and a possibly huge number of additions). From a 
practical standpoint, the bounds of Theorem C apply to “almost all” coefficients, 
and they seem to apply to all reasonable schemes for evaluation. Furthermore 
it is possible to obtain lower bounds corresponding to those of Theorem C even 
in the rational case: By strengthening the proofs above, V. Strassen has shown, 
for example, that the polynomial 


u(x) = `X gen gk (30) 


k=0 


cannot be evaluated by any polynomial chain of length < n?/lgn unless the 
chain has at least ¿n — 2 multiplications and n — 4 additions [SICOMP 3 (1974), 
128-149]. The coefficients of (30) are very large; but it is also possible to find 
polynomials whose coefficients are just Os and 1s, such that every polynomial 
chain computing them involves at least \/n/(41gn) chain multiplications, for all 
sufficiently large n, even when the parameters a; are allowed to be arbitrary 
complex numbers. [See R. J. Lipton, SICOMP 7 (1978), 61-69; C.-P. Schnorr, 
Lecture Notes in Comp. Sci. 53 (1977), 135-147.] Jean-Paul van de Wiele has 
shown that the evaluation of certain 0—1 polynomials requires a total of at least 
cn/logn arithmetic operations, for some c > 0 [FOCS 19 (1978), 159-165]. 


A gap still remains between the lower bounds of Theorem C and the actual 
operation counts known to be achievable, except in the trivial case n = 2. 
Theorem E gives |n/2| + 2 multiplications, not |n/2| + 1, although it does 
achieve the minimum number of additions. Our special methods for n = 4 and 
n = 6 have the minimum number of multiplications, but one extra addition. 
When n is odd, it is not difficult to prove that the lower bounds of Theorem C 
cannot be achieved simultaneously for both multiplications and additions; see 
exercise 33. For n = 3, 5, and 7, it is possible to show that at least |n/2| + 2 
multiplications are necessary. Exercises 35 and 36 show that the lower bounds 
of Theorem C cannot both be achieved when n = 4 or n = 6; thus the methods 
we have discussed are best possible, for n < 8. When n is even, Motzkin proved 
that |n/2| + 1 multiplications are sufficient, but his construction involves an 
indeterminate number of additions (see exercise 39). An optimal scheme for 
n = 8 was found by V. Y. Pan, who showed that n+ 1 additions are necessary 
and sufficient for this case when there are |n/2| + 1 multiplications; he also 
showed that |n/2| + 1 multiplications and n + 2 additions will suffice for all 
even n > 10. Pan’s paper [STOC 10 (1978), 162-172] also establishes the exact 
minimum number of multiplications and additions needed when calculations are 
done entirely with complex numbers instead of reals, for all degrees n. Exercise 40 
discusses the interesting situation that arises for odd values of n > 9. 


It is clear that the results we have obtained about chains for polynomials in 
a single variable can be extended without difficulty to multivariate polynomials. 


498 ARITHMETIC 4.6.4 


For example, if we want to find an optimum scheme for polynomial evaluation 
without adaptation of coefficients, we can regard u(x) as a polynomial in the 
n + 2 variables x, Un, ..., U1, Uo; exercise 38 shows that n multiplications and 
n additions are necessary in this case. Indeed, A. Borodin [Theory of Machines 
and Computations, edited by Z. Kohavi and A. Paz (New York: Academic Press, 
1971), 45-58] has proved that Horner’s rule (2) is essentially the only way to 
compute u(x) in 2n operations without preconditioning. 


With minor variations, the methods above can be extended to chains involv- 
ing division, that is, to rational functions as well as polynomials. Curiously, the 
continued-fraction analog of Horner’s rule now turns out to be optimal from an 
operation-count standpoint, if multiplication and division speeds are equal, even 
when preconditioning is allowed (see exercise 37). 

Sometimes division is helpful during the evaluation of polynomials, even 
though polynomials are defined only in terms of multiplication and addition; 
we have seen examples of this in the Shaw—Traub algorithms for polynomial 
derivatives. Another example is the polynomial 


a” +---+a+; 


since this polynomial can be written ("t+ — 1)/(a—1), we can evaluate it with 
l(n + 1) multiplications (see Section 4.6.3), two subtractions, and one division, 
while techniques that avoid division seem to require about three times as many 
operations (see exercise 43). 


Special multivariate polynomials. The determinant of an n x n matrix may 
be considered to be a polynomial in n? variables Zij, 1 <i, j <n. If £11 40, we 
have 


T11 T12 ... Lin 


ee E A £22 — (£21/£11)£12 «.. Tən — (£21/£11)Zin 
te E £32 — (£31/£11)£12 ... 3n — (L31/£11)£ln 

det = 211 det ; A . (31) 
Tni ae oe En2 — (@n1/@11) #12 --. Enn — (@n1/#11)L1n 


The determinant of an n x n matrix may therefore be evaluated by evaluating 
the determinant of an (n — 1) x (n — 1) matrix and performing an additional 
(n — 1)? + 1 multiplications, (n — 1)? additions, and n — 1 divisions. Since a 
2 x 2 determinant can be evaluated with two multiplications and one addition, 
we see that the determinant of almost all matrices (namely those for which no 
division by zero is needed) can be computed with at most (2n3 — 3n? + 7n — 6) /6 
multiplications, (2n? — 3n? + n)/6 additions, and (n? — n — 2)/2 divisions. 

When zero occurs, the determinant is even easier to compute. For example, 
if x1, = 0 but x2; 4 0, we have 


QO 19 ... in £12 en Tin 
T21 T22 ... Lan x32 — (£31/£21)£22 ... 3n — (£31/£21)£2n 
det | 31 732 --- 3n | _ go det . ; . (32) 


er oe oe Ln2 — (Ln1/2£21)T22 ... Enn — (An1/L21) Lan 
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Here the reduction to an (n — 1) x (n — 1) determinant saves n — 1 of the 
multiplications and n—1 of the additions used in (31), in compensation for the ad- 
ditional bookkeeping required to recognize this case. Thus any determinant can 
be evaluated with roughly $n° arithmetic operations (including division); this is 
remarkable, since it is a polynomial with n! terms and n variables in each term. 

If we want to evaluate the determinant of a matrix with integer elements, the 
procedure of (31) and (32) appears to be unattractive since it requires rational 
arithmetic. However, we can use the method to evaluate the determinant mod p, 
for any prime p, since division mod p is possible (exercise 4.5.2-16). If this is 
done for sufficiently many primes, the exact value of the determinant can be 
found as explained in Section 4.3.2, since Hadamard’s inequality 4.6.1—-(25) gives 
an upper bound on the magnitude. 

The coefficients of the characteristic polynomial det(aI — X) of an n x n ma- 
trix X can also be computed in O(n?) steps; see J. H. Wilkinson, The Algebraic 
Eigenvalue Problem (Oxford: Clarendon Press, 1965), 353-355, 410-411. Exer- 
cise 70 discusses an interesting division-free method that involves O(n*) steps. 

The permanent of a matrix is a polynomial that is very similar to the 
determinant; the only difference is that all of its nonzero coefficients are +1. 
Thus we have 


T11 ... Cin 
per | : i ita n (33) 
Tni +--+ Tnn 
summed over all permutations jı jo... jn of {1,2,...,n}. It would seem that 


this function should be even easier to compute than its more complicated-looking 
cousin, but no way to evaluate the permanent as efficiently as the determinant 
is known. Exercises 9 and 10 show that substantially fewer than n! operations 
will suffice, for large n, but the execution time of all known methods still grows 
exponentially with the size of the matrix. In fact, Leslie G. Valiant has shown 
that it is as difficult to compute the permanent of a given 0-1 matrix as it is to 
count the number of accepting computations of a nondeterministic polynomial- 
time Turing machine, if we ignore polynomial factors in the running time of the 
calculation. Therefore a polynomial-time evaluation algorithm for permanents 
would imply that scores of other well known problems that have resisted efficient 
solution would be solvable in polynomial time. On the other hand, Valiant proved 
that the permanent of an n x n integer matrix can be evaluated modulo 2* in 
O(n**-3) steps for all k > 2. [See Theoretical Comp. Sci. 8 (1979), 189-201.] 


Another fundamental operation involving matrices is, of course, matriz mul- 
tiplication: If X = (x;j) is an m x n matrix, Y = (yj) is an n x s matrix, and 
Z = (zx) is an m x s matrix, then the formula Z = XY means that 


zik =X toe.  1Lli<m, Lake (34) 


This equation may be regarded as the computation of ms simultaneous polynomi- 
als in mn + ns variables; each polynomial is the “inner product” of two n-place 
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vectors. A straightforward calculation would involve mns multiplications and 
ms(n — 1) additions; but S. Winograd discovered in 1967 that there is a way to 
trade about half of the multiplications for additions: 


Zik = 5 (£i 2j + Y2j—1,k) (Ti, 2j—1 + Y2j,k) — di — bk + LinYnk[n odd]; 
1<j<n/2 


a, = y Ti 2j Ti 2j—1; bk = J Y2j—1,kY2j,k- (35) 
1<j<n/2 1<j<n/2 


This scheme uses [n/2]ms + |n/2|(m+ s) multiplications and (n + 2)ms + 
(|n/2| — 1)\(ms + m + s) additions or subtractions; the total number of oper- 
ations has increased slightly, but the number of multiplications has roughly 
been halved. [See IEEE Trans. C-17 (1968), 693-694.] Winograd’s surprising 
construction led many people to look more closely at the problem of matrix mul- 
tiplication, and it touched off widespread speculation that n°/2 multiplications 
might be necessary to multiply n x n matrices, because of the somewhat similar 
lower bound that was known to hold for polynomials in one variable. 

An even better scheme for large n was discovered by Volker Strassen in 
1968; he found a way to compute the product of 2 x 2 matrices with only 
seven multiplications, without relying on the commutativity of multiplication 
as in (35). Since 2n x 2n matrices can be partitioned into four n x n matrices, 
his idea can be used recursively to obtain the product of 2* x 2" matrices with 
only 7* multiplications instead of (2*)? = 8*. The number of additions also 
grows as order 7*. Strassen’s original 2 x 2 identity [Numer. Math. 13 (1969), 
354-356] used 7 multiplications and 18 additions; S. Winograd later discovered 
the following more economical formula: 


a b A C\ _ aA+bB w+vu+(at+b—c—d)D 6 
(° A f 7 7 R e w+u+v ). (36) 
where u = (c—a)(C— D), v = (c+d)(C— A), w=aA+(c+d—a)(A+D—-C). 
If intermediate results are appropriately saved, this involves 7 multiplications 
and only 15 additions; by induction on k, we can multiply 2% x 2* matrices with 
7* multiplications and 5(7* — 4*) additions. The total number of operations 
needed to multiply n x n matrices has therefore been reduced from order n? 
to O(n'8") = O(n?8074), A similar reduction applies also to the evaluation of 
determinants and matrix inverses; see J. R. Bunch and J. E. Hopcroft, Math. 
Comp. 28 (1974), 231-236. 

Strassen’s exponent lg 7 resisted numerous attempts at improvement until 
1978, when Viktor Pan discovered that it could be lowered to log7) 143640 ~ 
2.795 (see exercise 60). This new breakthrough led to further intensive analysis of 
the problem, and the combined efforts of D. Bini, M. Capovani, D. Coppersmith, 
G. Lotti, F. Romani, A. Schénhage, V. Pan, and S. Winograd, produced a 
dramatic reduction in the asymptotic running time. Exercises 60-67 discuss 
some of the interesting techniques by which such upper bounds have been estab- 
lished; in particular, exercise 66 contains a reasonably simple proof that O(n?->°) 
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operations suffice. The best upper bound known as of 1997 is O(n?:37°), due to 
Coppersmith and Winograd |J. Symbolic Comp. 9 (1990), 251-280]. By contrast, 
the best current lower bound is 2n? — 1 (see exercise 12). 

These theoretical results are quite striking, but from a practical standpoint 
they are of little use because n must be very large before we overcome the effect of 
additional bookkeeping costs. Richard Brent [Stanford Computer Science report 
CS157 (March 1970), see also Numer. Math. 16 (1970), 145-156] found that 
a careful implementation of Winograd’s scheme (35), with appropriate scaling 
for numerical stability, became better than the conventional method only when 
n > 40, and it saved only about 7 percent of the running time when n = 100. For 
complex arithmetic the situation was somewhat different; scheme (35) became 
advantageous for n > 20, and saved 18 percent when n = 100. He estimated 
that Strassen’s scheme (36) would not begin to excel over (35) until n ~ 250; 
and such enormous matrices rarely occur in practice unless they are very sparse, 
when other techniques apply. Furthermore, the known methods of order n“ 
where w < 2.7 have such large constants of proportionality that they require 
more than 1073 multiplications before they start to beat (36). 

By contrast, the methods we shall discuss next are eminently practical and 
have found wide use. The discrete Fourier transform f of a complex-valued 
function F' of n variables, over respective domains of m1, ..., Mn elements, is 
defined by the equation 


(siti Sate 
Fis) = exp (2mi( ee + MEY) Pent) G7 
OSti<m1 1 if 
0<tn<mn 


for 0 < sı < mı, ..., 0 < Sn < mn; the name “transform” is justified because 
we can recover the values F(t,,...,t,) from the values f(s1,...,8,), as shown 
in exercise 13. In the important special case that all m; = 2, we have 


PB tsi) = 5 (—1)5 it tsatan P(t... tn) (38) 


O<t1,...,tn<1 


for 0 < s1,...,5, <1, and this may be regarded as a simultaneous evaluation of 
2” linear polynomials in 2” variables F'(t1,...,tn). A well-known technique due 
to F. Yates [The Design and Analysis of Factorial Experiments (Harpenden: 
Imperial Bureau of Soil Sciences, 1937)] can be used to reduce the number 
of additions implied in (38) from 2"(2" — 1) to n2”. Yates’s method can be 
understood by considering the case n = 3: Let X¢, t,t, = F(t1, te, ts). 

Given First step Second step Third step 

Xo00 Xooo+Xoo1 Xo00tXo01t-Xo10tXo11 XoootXo001t+-Xo10t-X011+-X100t-X101+X110tX111 
Xoor Xo1ot+Xo11 X100t+X101t+-X110t-X111 Xooo—Xo01 +X010—X011 + X100— X101+X110-X111 
Xoro X100tX101 Xooo—Xo01tXo10—X011 Xo00+Xo001—Xo010—X011+*100+-*101-X110-X111 
Xor X110tX111 X100—X101+-X110—X111 Xooo—Xo01—Xo10+X011 +X100—X101—X110+X111 
Xi00 Xooo—Xo0o1 XoootXo01-Xo10—Xo11 Xooo+Xo01 +Xo010+X011—X100—X101—X110—X111 
Xio01 Xoio—Xo11 Xıo0+X101—X110—X111 Xooo—Xo01+X010—X0o11—X100+X101—X110+X111 
Xı10 Xı00—X101 Xooo—Xo01—Xo10+X011 Xooo+Xo01—X010—X011—X100—X101+X110+X111 
Xı1ı Xiıo—~X111 Xı0o0—Xı101—X110+X111 Xooo—Xo01—Xo10+X011 —X100+X101+X110-X111 
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To get from the “Given” to the “First step” requires four additions and four 
subtractions; and the interesting feature of Yates’s method is that exactly the 
same transformation that takes us from “Given” to “First step” will take us 
from “First step” to “Second step” and from “Second step” to “Third step.” In 
each case we do four additions, then four subtractions; and after three steps we 
magically have the desired Fourier transform f(s1, $2, $3) in the place originally 
occupied by F (s1, 52, 83). 

This special case is often called the Hadamard transform or the Walsh 
transform of 2” data elements, since the corresponding pattern of signs was 
studied by J. Hadamard [Bull. Sci. Math. (2) 17 (1893), 240-246] and by J. L. 
Walsh [Amer. J. Math. 45 (1923), 5-24]. Notice that the number of sign changes 
from left to right in the “Third step” assumes the respective values 


0, 7, 3, 4, 1, 6, 2, 5; 


this is a permutation of the numbers {0,1,2,3,4,5,6,7}. Walsh observed that 
there will be exactly 0, 1, ..., 2” — 1 sign changes in the general case, if 
we permute the transformed elements appropriately, so the coefficients provide 
discrete approximations to sine waves with various frequencies. (See Section 
7.2.1.1 for further discussion of the Hadamard—Walsh coefficients.) 

Yates’s method can be generalized to the evaluation of any discrete Fourier 
transform, and, in fact, to the evaluation of any set of sums that can be written 
in the general form 


f (81, S23.. 18g) = 


5 gı(sı, $2,-++55n; t1) 92(s2, aay Sn, t2) a Gn(Sn tn) F(t, t2, aoe) tn) (39) 
O0<ti<m, 
O<tn<mn 


for 0 < sj < mj, given the functions 9;(s;,...,5n,t;). We proceed as follows. 


folti, ta, ts, . stn) = F(ti,t2,t3,.. ata) 


Ft Say ti; toys vs ytd) = 5 Gn(Sn;tn) folti; t2,- --, tn); 


Otn <Mn 


fa(Sn—1, $n; tias n2) = 5 Gn—1(Sn—1, Sn, tn—1) fi(Sn,t1,--+,tn—1)3 
O<tn—-1<Mn-1 


Falsi, 8, 88,000¢80)—= XO (51-5 Sns t1) fea Gay 835- + -3 Sn; t1); 
O<tiı<mı 
f (81, $2, 83,--+,8n) = fn($1, 52, 83,---, Sn). (40) 


For Yates’s method as shown above, g;(s;,..-,5n,tj;) = (—1)*%; folti, ta, ts) 
represents the “Given”; f1(s3,t1,t2) represents the “First step”; and so on. 
Whenever a desired set of sums can be put into the form of (39), for reasonably 
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simple functions g;(s;,...,5n,t;), the scheme (40) will reduce the amount of com- 
putation from order N? to order N log N or thereabouts, where N = m; ... Mn is 
the number of data points. Furthermore this scheme is ideally suited to parallel 
computation. The important special case of one-dimensional Fourier transforms 
is discussed in exercises 14 and 53; we have considered the one-dimensional case 
also in Section 4.3.3C. 


Let us consider one more special case of polynomial evaluation. Lagrange’s 
interpolation polynomial of order n, which we shall write as 


(x—zxı)(x— z2)... (£— £n) (x—zo)(£z— z2)... (£— £n) 
(zo —21)(£0— T2)... (£o— Be) (@1—z0)(£1— T2)... (£1 — Tn) 


(x—zo)(x— zı)... (£—£n-1) TE 


7 


Un] (x) = Yo 


TO OGG, = 0) tam). (tat) 


is the only polynomial of degree < n in x that takes on the respective values 
Yo. Y1; +++) Yn at the n+1 distinct points x = x9, £1, ..., £n. (For it is evident 
from (41) that umj (zk) = yk for 0 < k < n. If f(x) is any such polynomial 
of degree < n, then g(x) = f(x) — umj(x) is of degree < n, and g(x) is zero 
for £x = £o, Z1, ..., Zn; therefore g(x) must be a multiple of the polynomial 
(x — zo)(x — z1)... (£ — £n). The degree of the latter polynomial is greater 
than n, so g(x) = 0.) If we assume that the values of a function in some table 
are well approximated by a polynomial, formula (41) may therefore be used to 
“interpolate” for values of the function at points x not appearing in the table. 
Lagrange presented (41) to his class at the Paris Ecole Normale in 1795 [see 
his Œuvres 7 (Paris: 1877), 286]; but Edward Waring of Cambridge University 
actually deserves the credit, because he had already presented the same formula 
quite clearly and explicitly in Philosophical Transactions 69 (1779), 59-67. 

There seem to be quite a few additions, subtractions, multiplications, and 
divisions in Waring and Lagrange’s formula; in fact, there are exactly n additions, 
2n? + 2n subtractions, 2n? + n — 1 multiplications, and n+ 1 divisions. But 
fortunately (as we might be conditioned to suspect by now), improvement is 
possible. 

The basic idea for simplifying (41) is to exploit the fact that 


Un} (£) — Un—1 (x) = 0 for £ = £o, ..., En—1; 


thus umj (£) — um-1](x) is a polynomial of degree n or less, and a multiple of 
(x — zo)... (£ — £n—1). We conclude that ujm (£) = an (x — xo)... (£ — @n-1) + 
Un—1(@), where a is a constant. This leads us to Newton’s interpolation formula 


Win) (£) = Og (@ — zo) (£ — z1)... (£ — Baia) + 


+ a(x — xo)(x — z1) + a1 (x — xo) + ao, (42) 


where the a’s are some coefficients that we want to determine from the given 
numbers zo, %1,---,; En, Yo; Y1; ---; Yn- Notice that this formula holds for all n; 
the coefficient a, does not depend on x41, .--, Zn, OF ON Yk+1; ---; Yn. Once 
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the a’s are known, Newton’s interpolation formula is convenient for calculation, 
since we may generalize Horner’s rule once again and write 


Un (£) = ((.-- (@n(—-Fn—1) + On—1)(@—-An_2) ++++)(t@—ap) +09). (43) 


This requires n multiplications and 2n additions. Alternatively, we may evaluate 
each of the individual terms of (42) from right to left; with 2n — 1 multiplications 
and 2n additions we thereby calculate all of the values ujo (£), up (£), ---, Ufnj(2), 
and this indicates whether or not an interpolation process is converging. 

The coefficients a, in Newton’s formula may be found by computing the 
divided differences in the following tableau (shown for n = 3): 


"0 O —yo)/(e1— 20) = yi 


yı , (Y2=41)/(z2= 20) = 92 p n m 
(y2—y1)/(t2-21) = Yb : 1)/( = n (Y3 —Y2)/(@3— x0) = 43 
Y2 T; (y3 —y2)/ (£3 = £1) = Y5 
(ys —y2)/(£3— z2) = Y3 
y3 (44) 


It is possible to prove that ao = yo, a1 = yi, @2 = y4, etc., and to show 
that the divided differences have important relations to the derivatives of the 
function being interpolated; see exercise 15. Therefore the following calculation 
(corresponding to (44)) may be used to obtain the a’s: 


Start with (ao, a1,.--,Q@n) < (Yo, Y1,- --; Yn); 
then, for k = 1, 2, ..., n (in this order), 
set aj < (a; — aj—1)/(£j — £j—k) for j =n, n— 1, ..., k (in this order). 


This process requires $(n? + n) divisions and n? + n subtractions, so about 


three-fourths of the work implied in (41) has been saved. 
For example, suppose that we want to estimate 1.5! from the values of 
O!, 1!, 2!, and 3!, using a cubic polynomial. The divided differences are 


vi 


x | y y y" y 


ONE 
Lie. oe 4 

3 3 
2} 2, 3 
3 |6 


so uj(£) = up (£) = 1, ug (x) = æ(x — 1) + 1, uj (£) = 4z(x — 1)(x — 2) + 
$a(z—1)+1. Setting x = 1.5 in uja (x) gives —.125+.375+1 = 1.25; presumably 
the “correct” value is ['(2.5) = 3yr ~ 1.33. (But there are of course many other 
sequences that begin with the numbers 1, 1, 2, and 6.) 

If we want to interpolate several polynomials that have the same interpola- 
tion points £o, £1, ..., Zn but varying values yo, Y1, ---, Yn, it is desirable to 
rewrite (41) ina form suggested by W. J. Taylor [J. Research Nat. Bur. Standards 
35 (1945), 151-155]: 


uma) = (+ +e (et to), (45) 
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when x ¢ {%9,21,.--,2n}, where 
Wr = 1/(£k — 0)... (Ek — Le-1)(Lk — Teq1)--- (Le — Ln). (46) 


This form is also recommended for its numerical stability [see P. Henrici, Essen- 
tials of Numerical Analysis (New York: Wiley, 1982), 237-243]. The denominator 
of (45) is the partial fraction expansion of 1/(x — xo)(x — z1)... (£ — £n). 

An important and somewhat surprising application of polynomial interpo- 
lation was discovered by Adi Shamir [CACM 22 (1979), 612-613], who observed 
that polynomials mod p can be used to “share a secret.” This means that we can 
design a system of secret keys or passwords such that the knowledge of any n+1 
of the keys enables efficient calculation of a magic number N that unlocks a door 
(say), but the knowledge of any n of the keys gives no information whatsoever 
about N. Shamir’s amazingly simple solution to this problem is to choose a 
random polynomial u(x) = ung” +- - -+u1z+uo, where 0 < u; < pand pis a large 
prime number. Each part of the secret is an integer x in the range 0 < x < p, 
together with the value of u(x) mod p; and the supersecret number N is the 
constant term uo. Given n + 1 values u(x;), we can deduce N by interpolation. 
But if only n values of u(x;) are given, there is a unique polynomial u(x) having 
a given constant term but the same values at £1, ..., £n; thus the n values do 
not make one particular N more likely than any other. 

It is instructive to note that evaluation of the interpolation polynomial 
is just a special case of the Chinese remainder algorithm of Section 4.3.2 and 
exercise 4.6.2-3, since we know the values of ujnj(x) modulo the relatively prime 
polynomials x — x, ..., © — Tn- (As we have seen in Section 4.6.2 and in the 
discussion following (3), f(x) mod (a — xo) = f(xo).) Under this interpretation, 
Newton’s formula (42) is precisely the “mixed-radix representation” of Eq. 4.3.2- 
(25); and 4.3.2-(24) yields another way to compute ao, ..., Qn using the same 
number of operations as (44). 


By applying fast Fourier transforms, it is possible to reduce the running time 
for interpolation to O(n (log n)*), and a similar reduction can also be made for 
related algorithms such as the solution to the Chinese remainder problem and the 
evaluation of an nth degree polynomial at n different points. [See E. Horowitz, 
Inf. Proc. Letters 1 (1972), 157-163; A. Borodin and R. Moenck, J. Comp. 
Syst. Sci. 8 (1974), 336-385; A. Borodin, Complexity of Sequential and Parallel 
Numerical Algorithms, edited by J. F. Traub (New York: Academic Press, 1973), 
149-180; D. Bini and V. Pan, Polynomial and Matrix Computations 1 (Boston: 
Birkhauser, 1994), Chapter 1.] However, these observations are primarily of 
theoretical interest, since the known algorithms have a rather large overhead 
factor that makes them unattractive unless n is quite large. 


A remarkable extension of the method of divided differences, which applies 
to quotients of polynomials as well as to polynomials, was introduced by T. N. 
Thiele in 1909. Thiele’s method of “reciprocal differences” is discussed in L. M. 
Milne-Thompson’s Calculus of Finite Differences (London: MacMillan, 1933), 
Chapter 5; see also R. W. Floyd, CACM 3 (1960), 508. 
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*Bilinear forms. Several of the problems we have considered in this section are 
special cases of the general problem of evaluating a set of bilinear forms 


Zh = 5 5 tijktiYj, forl<k<s, (47) 


i=1 j=1 


where the tijk are specific coefficients belonging to some given field. The three- 
dimensional array (t;;,) is called an m x n x s tensor, and we can display it by 
writing down s matrices of size m x n, one for each value of k. For example, the 
problem of multiplying complex numbers, namely the problem of evaluating 


zı + izz = (z1 + izo) (yı + tye) = (@1y1—L2y2) + i(a1y24+-Ve2y1), (48) 


is the problem of computing the bilinear form specified by the 2 x 2 x 2 tensor 


(0 1) @ o) 


Matrix multiplication as defined in (34) is the problem of evaluating a set of 
bilinear forms corresponding to a particular mn x ns x ms tensor. Fourier 
transforms (37) can also be cast in this mold, although they are linear instead 
of bilinear, if we let the x’s be constant rather than variable. 

The evaluation of bilinear forms is most easily studied if we restrict our- 
selves to what might be called normal evaluation schemes, in which all chain 
multiplications take place between a linear combination of the x’s and a linear 
combination of the y’s. Thus, we form r products 


wy = (auzi + +++ + Gmi tm) (buyi +++ + bnyn), forl<i<r, (49) 
and obtain the z’s as linear combinations of these products, 
Zk = Ck1W1 ++: + Chr Wr, forl<k<-s. (50) 


Here all the a’s, b’s, and c’s belong to a given field of coefficients. By comparing 
(50) to (47), we see that a normal evaluation scheme is correct for the tensor 
(tijk) if and only if 

tijk = Qi1bj1Ck1 + +++ + Gir bjr Cer (51) 
forl<i<m,1<j<n,andl<k<s. 

A nonzero tensor (tijk) is said to be of rank one if there are three vectors 
(a1,---,@m), (b1,---,n), (C1, ---,Cs) such that tijk = aibjcx for all i, j, k. We 
can extend this definition to all tensors by saying that the rank of (tijk) is the 
minimum number r such that (t;;,) is expressible as the sum of r rank-one 
tensors in the given field. Comparing this definition with Eq. (51) shows that 
the rank of a tensor is the minimum number of chain multiplications in a normal 
evaluation of the corresponding bilinear forms. Incidentally, when s = 1 the 
tensor (tijk) is just an ordinary matrix, and the rank of (t;j1) as a tensor is the 
same as its rank as a matrix (see exercise 49). The concept of tensor rank was 
introduced by F. L. Hitchcock in J. Math. and Physics 6 (1927), 164-189; its 
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application to the complexity of polynomial evaluation was pointed out in an 
important paper by V. Strassen, Crelle 264 (1973), 184-202. 

Winograd’s scheme (35) for matrix multiplication is “abnormal” because it 
mixes x’s and y’s before multiplying them. The Strassen—Winograd scheme (36), 
on the other hand, does not rely on the commutativity of multiplication, so it is 
normal. In fact, (36) corresponds to the following way to represent the 4 x 4 x 4 
tensor for 2 x 2 matrix multiplication as a sum of seven rank-one tensors: 


1000\ /0000\ /0010\ /0000 1000\ /1000\ /1000\ /1000 
0100 \/ 0000 \{ 0001 \f 0000 |__| 0000 \{ 0000 \{ 0000 \{ 0000 
0000 J/\ 1000 j| 0000 J/\ 0010 | | 0000 j| 0000 }\ 0000 ]\ 0000 
0000/7 \01007 \00007 \0001 0000/7 \0000/ \00007 \o000 
0000\ /0000\ s0000\ 70000 0000\ /0011 0000\ fooil 
_{ 0100 \{ 0000 \/ 0000 \/ 0000) , { 0000 \{ goon || 0000 \| o000 
0000 }\ 0000 J| 0000 }\ 0000 0000 J| 0011 0000 ]\ 0011 
0000/7 \0000/ \0000/7 \0000 0000/7 \oo000/ \00007 \0000 
0000\ /0000\ s0000\ 70000 0000\ /0000\ /0000\ /0000 
_{ 0000 \/ 0000 \/ 0000 \/ 0000 \ , { 0000 \{ 0000 \f 0000 \|/ 0000 
0000 J| 0000 }\ 0000 }\ 0000 0000 }/\ 0000 }/\ 1010 }\ 1010 
00007 \i1117 \00007 \0000 00007 \o0007 \10107 \1010 


0000 0000 0001 0000 0000 1011 1011 1011 
, { 0000 0000 0001 0000), / 0000 0000 0000 0000 

0000 0000 0001 0000 

0000 0000 0001 0000 


0000 
(Here 1 stands for —1.) 


0000 

The fact that (51) is symmetric in i, j, k and invariant under a variety 
of transformations makes the study of tensor rank mathematically tractable, 
and it also leads to some surprising consequences about bilinear forms. We 
can permute the indices 7, j, k to obtain “transposed” bilinear forms, and the 
transposed tensor clearly has the same rank; but the corresponding bilinear forms 
are conceptually quite different. For example, a normal scheme for evaluating an 
(mxn) times (n x s) matrix product implies the existence of a normal scheme to 
evaluate an (n x s) times (s xm) matrix product, using the same number of chain 
multiplications. In matrix terms these two problems hardly seem to be related 
at all—they involve different numbers of dot products, on vectors of different 
sizes —but in tensor terms they are equivalent. [See V. Y. Pan, Uspekhi Mat. 
Nauk 27,5 (September—October 1972), 249-250; J. E. Hopcroft and J. Musinski, 
SICOMP 2 (1973), 159-173.] 

When the tensor (tijk) can be represented as a sum (51) of r rank-one 
tensors, let A, B, C be the matrices (ai), (bj1), (Cki) of respective sizes m x r, 
n xr, sx r; we shall say that (A, B,C) is a realization of the tensor (tijk). For 
example, the realization of 2 x 2 matrix multiplication in (52) can be specified 
by the matrices 


1011 }\1011/]\ 1011 
1011/ \1011/ \1011 


(52) 


1010011 10071101 1100000 
0100010 0101000 1011001 

A = ae à B = ~ r CG = " 
0010111 0011101 1000111) 63 


0001111 0011011 1010101 
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An mx n x s tensor (tijk) can also be represented as a matrix by grouping 
its subscripts together. We shall write (t(;;),) for the mn x s matrix whose rows 
are indexed by the pair of subscripts (i, 7) and whose columns are indexed by k. 
Similarly, (t,(;;)) stands for the s x mn matrix that contains tijk in row k and 
column (i,j); (tak)j) is an ms x n matrix, and so on. The indices of an array 
need not be integers, and we are using ordered pairs as indices here. We can use 
this notation to derive the following simple but useful lower bound on the rank 
of a tensor. 


Lemma T. Let (A,B,C) be a realization of an m x n x s tensor (tijk). Then 
rank(A) > rank(t,7j,)), rank(B) > rank(t;(n)), and rank(C) > rank(tk{ij)); 
consequently 


rank(t;jx) > max (rank(tigr)), rank(tj(;4)), rank(t,(:;)))- 


Proof. It suffices by symmetry to show that r > rank(A) > rank(t,(;,)). Since 
A is an m x r matrix, it is obvious that A cannot have rank greater than r. 
Furthermore, according to (51), the matrix (t,;,)) is equal to AQ, where Q is 
the r x ns matrix defined by Qiij,4) = bjiCkı. If x is any row vector such that 
zA = 0 then xAQ = 0, hence all linear dependencies in A occur also in AQ. It 
follows that rank( AQ) <rank(A). | 


As an example of the use of Lemma T, let us consider the problem of 
polynomial multiplication. Suppose we want to multiply a general polynomial 
of degree 2 by a general polynomial of degree 3, obtaining the coefficients of the 
product: 


(£o + aut xQu”)(yo + yru + you? + ysu”) 
= 29 + Z1U + zou? + zgu® + zau* + zu. (54) 


This is the problem of evaluating six bilinear forms corresponding to the 3x 4x6 
tensor 


1000\ /0100\ s/0010\ s0001\ s0000\ 70000 
(9000) (1900) (0100) (0010) (0001) (0000). (55) 
00007 \00007 \10007% \0 100% \00107 \O001 


For brevity, we may write (54) as z(u)y(u) = z(u), letting x(u) denote the 
polynomial £o + z1u + xzu?, etc. (We have come full circle from the way we 
began this section, since Eq. (1) refers to u(x), not z(u); the notation has changed 
because the coefficients of the polynomials are now the variables of interest to us.) 

If each of the six matrices in (55) is regarded as a vector of length 12 indexed 
by (2,7), it is clear that the vectors are linearly independent, since they are 
nonzero in different positions; hence the rank of (55) is at least 6 by Lemma T. 


Conversely, it is possible to obtain the coefficients zo, 21, ..., 25 by making only 
six chain multiplications, for example by computing 

«(0)y(0), jy), -.-, #(5)y(5); (56) 
this gives the values of z(0), z(1), ..., 2(5), and the formulas developed above 


for interpolation will yield the coefficients of z(u). The evaluation of æ(j) 
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and y(j) can be carried out entirely in terms of additions and/or parameter 
multiplications, and the interpolation formula merely takes linear combinations 
of these values. Thus, all of the chain multiplications are shown in (56), and 
the rank of (55) is 6. (We used essentially this same technique when multiplying 
high-precision numbers in Algorithm 4.3.3T.) 

The realization (A, B, C) of (55) sketched in the paragraph above turns out 


to be 
120 0 0 0 0 o0 


111111 111 1 1 1 —274 600 —600 400 —150 24 

0123 4 5 012 3 4 5 225 —770 1070 —780 305 —50 yai 

0149 16 25 ’ 014 916 25 }’ —85 355 —590 490 —205 35 120 
0 1 8 27 64 125 15 —70 130 —120 55 —10 


-1 5 = 10 =5 1/ (57) 


Thus, the scheme does indeed achieve the minimum number of chain multipli- 
cations, but it is completely impractical because it involves so many additions 
and parameter multiplications. We shall now study a practical approach to the 
generation of more efficient schemes, introduced by S. Winograd. 

In the first place, to evaluate the coefficients of z(u)y(u) when deg(x) = m 
and deg(y) = n, we can use the identity 


a(u)y(u) = (x(u)y(u) mod p(u)) + £mynp(u), (58) 


when p(w) is any monic polynomial of degree m+n. The polynomial p(w) should 
be chosen so that the coefficients of x(u)y(u) mod p(u) are easy to evaluate. 

In the second place, to evaluate the coefficients of x(u) y(u) mod p(u), when 
the polynomial p(u) can be factored into q(u)r(u) where gcd(q(u),r(u)) = 1, we 
can use the identity 


w(u)y(u) mod q(u)r(u) = (a(u)r(u)(x(u)y(u) mod q(u)) 

+ b(u)q(u)(a(u)y(u) mod r(u))) mod q(u)r(u) (59) 
where a(u)r(u) +b(u)q(u) = 1; this is essentially the Chinese remainder theorem 
applied to polynomials. 

In the third place, we can always evaluate the coefficients of the polynomial 
x(u)y(u) mod p(u) by using the trivial identity 
a(u)y(u) mod p(u) = (x(u) mod p(u)) (y(u) mod p(u)) mod p(w). (60) 


Repeated application of (58), (59), and (60) tends to produce efficient schemes, 
as we shall see. 

For our example problem (54), let us choose p(w) = u®—u and apply (58); the 
reason for this choice of p(u) will appear as we proceed. Writing p(u) = u(u*—1), 
rule (59) reduces to 


x(u)y(u) mod u(u* — 1) = (—(u* —1)aoyo + u4(a(u)y(u) mod (ut — 1))) 
mod (u? — u). (61) 


Here we have used the fact that z(u)y(u) mod u = zoyo; in general it is a good 
idea to choose p(u) in such a way that p(0) = 0, so that this simplification can be 
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used. If we could now determine the coefficients wo, w1, W2, w3 of the polynomial 


x(u)y(u) mod (uf —1) = wo + wiu + wou? + w3u%, our problem would be solved, 


since 
ut (x(u)y(u) mod (u* — 1)) mod (u? — u) = wout + wiu + wou" + w3u%, 
and the combination of (58) and (61) would reduce to 
x(u)y(u) = zoyo + (wi — z2y3)u + wou? + w3u® + (wo — xoyo)u* + zoyzuë. (62) 


(This formula can, of course, be verified directly.) 

The problem remaining to be solved is to compute z(u)y(u) mod (uf — 1); 
and this subproblem is interesting in itself. Let us momentarily allow z(u) to be 
of degree 3 instead of degree 2. Then the coefficients of x(u)y(u) mod (u* — 1) 
are respectively 


Toyo + 21 y3 + T2Y2 + T3Y1, TLoyı + L1Yo + T2Y3 + T3Y2, 
LoY2 + T1Yı + T2Yo + T3Y3, Toyz + L1Y2 + T2Y1 + T3Y0, 


and the corresponding tensor is 
1000 


OOOO 


0\ /0010\ /0001 

0\ fo0100)\ (oo10 

1] |1000] |o100]- (63) 
0/ \0001/ \1000 


ooreo 
OoOo 


0001 
0010 
0100 


In general when deg(x) = deg(y) = n—1, the coefficients of x(u)y(u) mod (u”—1) 
are called the cyclic convolution of (£0, £1,...,&n—1) and (Yo, Y1,---;Yn—1)- The 
kth coefficient wz is the bilinear form } ` x;y; summed over all i and j with 
i+ j =k (modulo n). 

The cyclic convolution of degree 4 can be obtained by applying rule (59). 
The first step is to find the factors of ut — 1, namely (u—1)(w+1)(u?+1). We 
could write this as (u? — 1)(u? +1), then apply rule (59), then use (59) again on 
the part modulo (u?—1) = (u—1)(u+1); but it is easier to generalize the Chinese 
remainder rule (59) directly to the case of several relatively prime factors. For 
example, we have 


ee m ee o n 
= (a( u)(x(u)y(u) mod qı(u))+a2(u)qı(u)q3(u) (x(u)y(u) mod q2(u)) 
a k 1(u)q2(u) (z(u)y(u) mod qs(u))) mod qi(u)qa(u)qs(u), (64) 


where aı(u)q2(u)qs(u) + a2(u)qı(u)q3(u) + as(u)qı(u)q2(u) = 1. (This equation 
can also be understood in another way, by noting that the partial fraction expan- 


sion of 1/qı(u)q2(u)q3(u) is aı(u)/qı(u) + a2(u)/g2(u) + a3(u)/q3(u).) From (64) 
we obtain 


x(u)y(u) mod (u4—1) = (43 +u? +u+1) (1) y(1) F(u? u? +u—1)z(—1)y(—1) 
—4(u?—1)(x(u)y(u) mod (u?+1))) mod (u4—1). (65) 


The remaining problem is to evaluate x(u)y(u) mod (u? + 1), and it is time 
to invoke rule (60). First we reduce z(u) and y(u) mod (u? + 1), obtaining 
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X (u) = (zo — x2) + (xı — x3)u, Y (u) = (yo — y2) + (yı — ys)u. Then (60) tells 
us to evaluate X (u)Y (u) = Zo + Ziu + Z2u?, and to reduce this in turn modulo 
(u? + 1), obtaining (Zo — Z2) + Zu. The job of computing X(u)Y (u) is simple; 
we can use rule (58) with p(u) = u(u + 1) and we get 


Zo = XoYo, Zi = XoYo — (Xo—X1)(Yo-Yi1) + XN, Z2 = XY. 


(We have thereby rediscovered the trick of Eq. 4.3.3-(2) in a more systematic 
way.) Putting everything together yields the following realization (A, B,C) of 
degree-4 cyclic convolution: 


11101 11101 11220 
11011 11011 11022) t (66) 
111001’ 11101’ 11220) 4 
11011 11011 11222 
Here 1 stands for —1 and 2 stands for —2. 
The tensor for cyclic convolution of degree n satisfies 
tijk = tk,—j,is (67) 
treating the subscripts modulo n, since tijk = 1 if and only if i+ j = k 


(modulo n). Thus if (au), (bj), (cxi) is a realization of the cyclic convolution, so 
is (cer), (b-;,1), (ai); in particular, we can realize (63) by transforming (66) into 


11220 11101 11101 
11222], 1 11011 11011 (68) 
11220 4’ L101) 11101 
112722 11011 11011 


Now all of the complicated scalars appear in the A matrix. This is important 
in practice, since we often want to compute the convolution for many values of 
Yo, Y1, Y2, y3 but for a fixed choice of zo, £1, £2, £3. In such a situation, the 
arithmetic on x’s can be done once and for all, and we need not count it. Thus 
(68) leads to the following scheme for evaluating the cyclic convolution wo, w1, 
w2, W3 when zo, £1, £2, £3 are known in advance: 


Sı = Yo + Y2, S2 = Yı + ys, $3 5S1 +52, S4 = 81 — $2, 


55 = Yo = Y2, S6 = Y3 — Yı, §7 = 55 — 56; 


_1 Ï 
my = 4(fo + T1 + T2 + £3) 53, Mo = 7(%o — 21 + T2 — 23) + Sa, 


m3 = 5(Lo+21—L2—23)°85, m4= $ (—ro+z1+£2—1£3)-S6, ms = 5 (@3—21)-87; 


ti=sm +m t2=mMmg+ms, t3 =m, —mM2, t4 = Ma4 -— ms; 
wo =ti +t2, wi =t3+t4, wea=tr—te, W3 = t3 -— ty. (69) 
There are 5 multiplications and 15 additions, while the definition of cyclic 


convolution involves 16 multiplications and 12 additions. We will prove later 
that 5 multiplications are necessary. 
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Going back to our original multiplication problem (54), using (62), we have 
derived the realization 


1000000 

5 1011101 0111011 
eer ee 0011011 0011101 a 
pdtagooy ¢ 0011101]” nioo n e a 

0111011 1011101 

0100000 


This scheme uses one more than the minimum number of chain multiplications, 
but it requires far fewer parameter multiplications than (57). Of course, it 
must be admitted that the scheme is still rather complicated: If our goal is 
simply to compute the coefficients zo, 21, ..., 25 of the product of two given 
polynomials (£o + xu + xou?) (yo + yu + you? + yzu?), as a one-shot problem, 
our best bet may well be to use the obvious method that does 12 multiplications 
and 6 additions — unless (say) the x’s and y’s are matrices. Another reasonably 
attractive scheme, which requires 8 multiplications and 18 additions, appears in 
exercise 58(b). Notice that if the z’s are fixed as the y’s vary, (70) does the 
evaluation with 7 multiplications and 17 additions. Even though this scheme 
isn’t especially useful as it stands, our derivation has illustrated important 
techniques that are useful in a variety of other situations. For example, Winograd 
has used this approach to compute Fourier transforms using significantly fewer 
multiplications than the fast Fourier transform algorithm needs (see exercise 53). 

Let us conclude this section by determining the exact rank of the n x n x n 
tensor that corresponds to the multiplication of two polynomials modulo a third, 


otau- Hna! 


= (ap + TiU +- + En1U”T t) (yo + yru +- + Yyn—1u™ +) mod p(u). (71) 


Here p(u) stands for any given monic polynomial of degree n; in particular, p(u) 
might be u” — 1, so one of the results of our investigation will be to deduce the 
rank of the tensor corresponding to cyclic convolution of degree n. It will be 
convenient to write p(u) in the form 


p(u) = u” — pp_iu™* — ++» — piu — po, (72) 

so that u” = pọ + piu +-+: + pniu”! (modulo p(u)). 
The tensor element tijg is the coefficient of u* in utti mod p(w); and this is 

the element in row i, column k of the matrix P, where 


0 1 0... 0 
0 0 1. ... 0 
Paji $ i (73) 
0 0 0 .. 1 
Po Pi P2 +++ Pn-1 


is called the companion matrix of p(u). (The indices i, j, k in our discussion will 
run from 0 to n — 1 instead of from 1 to n.) It is convenient to transpose the 
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tensor, for if Tijk = tikj the individual layers of (Tj,;,) for k = 0, 1, 2, ...,n— 1 
are simply given by the matrices 


I P P hes pee, (74) 


The first rows of the matrices in (74) are respectively the unit vectors 
(1,0,0,...,0), (0,1,0,...,0), (0,0,1,...,0), ..., (0,0,0,...,1), hence a linear 
combination ye vg PE will be the zero matrix if and only if the coefficients vk 
are all zero. Furthermore, most of these linear combinations are actually non- 
singular matrices, for we have 


n-1 
(wo, W1,---,;Wn-1) So un P* = (0,0,...,0) 
k=0 
if and only if v(u)w(u) =0 (modulo p(u)), 


where v(u) = vo + vru +: +++ Up—1u"? and w(u) = wo + wru +: +Wwn-1u”7!. 


Thus, Ss vp P* is a singular matrix if and only if the polynomial v(u) is a 
multiple of some factor of p(u). We are now ready to prove the desired result. 


Theorem W (S. Winograd, 1975). Let p(u) be a monic polynomial of degree n 
whose complete factorization over a given infinite field is 


plu) = piu)... pau)“. (75) 


Then the rank of the tensor (74) corresponding to the bilinear forms (71) is 2n—q 
over this field. 


Proof. The bilinear forms can be evaluated with only 2n—gq chain multiplications 
by using rules (58), (59), (60) in an appropriate fashion, so we must prove only 
that the rank r is > 2n — q. The discussion above establishes the fact that 
rank(Tu;jjk) = n; hence by Lemma T, any n x r realization (A, B,C) of (Tijg) 
has rank(C) = n. Our strategy will be to use Lemma T again, by finding a 


vector (vo, V1,---,;Un—1) that has the following two properties: 
i) The vector (vo, v1,...,Un—1)C has at most q +r — n nonzero coefficients. 
ii) The matrix v(P) = 029 vx P* is nonsingular. 


This and Lemma T will prove that q +r — n > n, since the identity 


r n—-1 

X aubj ( 5 vcu) = v(P)ij 
l=1 k=0 

shows how to realize the n x n x 1 tensor v(P) of rank n with q +r — n chain 

multiplications. 

We may assume for convenience that the first n columns of C are linearly 
independent. Let D be the n x n matrix such that the first n columns of DC are 
equal to the identity matrix. Our goal will be achieved if there is a linear combi- 
nation (vo, U1,---;Un—1) of at most q rows of D, such that v(P) is nonsingular; 
such a vector will satisfy conditions (i) and (ii). 
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Since the rows of D are linearly independent, no irreducible factor p) (u) can 
divide the polynomials corresponding to every row. Given a vector 


w = (Wo, W1,---,Wn—1); 


let covered(w) be the set of all A such that w(u) is not a multiple of p,(u). From 
two vectors v and w we can find a linear combination v + aw such that 


covered(v + aw) = covered(v) U covered(w), (76) 


for some a in the field. The reason is that if À is covered by v or w but not both, 
then A is covered by v + aw for all nonzero a; if A is covered by both v and w 
but A is not covered by v + aw, then A is covered by v + w for all 6 4 a. By 
trying q+1 different values of a, at least one must yield (76). In this way we can 
systematically construct a linear combination of at most q rows of D, covering 
allA forl<A<q J 


One of the most important corollaries of Theorem W is that the rank of a 
tensor can depend on the field from which we draw the elements of the realization 
(A, B,C). For example, consider the tensor corresponding to cyclic convolution 
of degree 5; this is equivalent to multiplication of polynomials mod p(u) = u —1. 
Over the field of rational numbers, the complete factorization of p(w) is (u — 1) x 
(ut + u? + u? + u + 1) by exercise 4.6.2-32, so the tensor rank is 10 — 2 = 8. On 
the other hand, the complete factorization over the real numbers, in terms of 
the number ¢ = $(1+ V5), is (u — 1)(u? + du + 1)(u? — d-!u + 1); thus, the 
rank is only 7, if we allow arbitrary real numbers to appear in A, B, C. Over 
the complex numbers the rank is 5. This phenomenon does not occur in two- 
dimensional tensors (matrices), where the rank can be determined by evaluating 
determinants of submatrices and testing for 0. The rank of a matrix does not 
change when the field containing its elements is embedded in a larger field, but 
the rank of a tensor can decrease when the field gets larger. 

In the paper that introduced Theorem W [Math. Systems Theory 10 (1977), 
169-180], Winograd went on to show that all realizations of (71) in 2n — q 
chain multiplications correspond to the use of (59), when q is greater than 1. 
Furthermore he has shown that the only way to evaluate the coefficients of 
x(u)y(u) in deg(a) + deg(y) + 1 chain multiplications is to use interpolation 
or to use (58) with a polynomial that splits into distinct linear factors in the 
field. Finally he has proved that the only way to evaluate x(u)y(u) mod p(u) in 
2n — 1 chain multiplications when q = 1 is essentially to use (60). These results 
hold for all polynomial chains, not only “normal” ones. He has extended the 
results to multivariate polynomials in SICOMP 9 (1980), 225-229. 

The tensor rank of an arbitrary m x n x 2 tensor in a suitably large field 
has been determined by Joseph Ja’Ja’, SICOMP 8 (1979), 443-462; JACM 27 
(1980), 822-830. See also his interesting discussion of commutative bilinear 
forms in SICOMP 9 (1980), 713-728. However, the problem of computing the 
tensor rank of an arbitrary n x n x n tensor over any finite field is NP-complete 
(J. Hastad, Journal of Algorithms 11 (1990), 644-654]. 
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For further reading. In this section we have barely scratched the surface of a 
very large subject in which many beautiful theories are emerging. Considerably 
more comprehensive treatments can be found in the books Computational Com- 
plexity of Algebraic and Numeric Problems by A. Borodin and I. Munro (New 
York: American Elsevier, 1975); Polynomial and Matrix Computations 1 by 
D. Bini and V. Pan (Boston: Birkhauser, 1994); Algebraic Complexity Theory by 
P. Biirgisser, M. Clausen, and M. Amin Shokrollahi (Heidelberg: Springer, 1997). 


EXERCISES 
1. [15] What is a good way to evaluate an “odd” polynomial 


2n 2n-1 


1 
u(x) = uzani a??? + un- +++ tux? 


2. [M20] Instead of computing u(x + xo) by steps H1 and H2 as in the text, discuss 
the application of Horner’s rule (2) when polynomial multiplication and addition are 
used instead of arithmetic in the domain of coefficients. 

3. [20] Give a method analogous to Horner’s rule, for evaluating a polynomial in 
two variables }7 54 5<n uijz'y). (This polynomial has (n + 1)(n + 2)/2 coefficients, and 
its “total degree” is n.) Count the number of additions and multiplications you use. 


4. [M20] The text shows that scheme (3) is superior to Horner’s rule when we are 
evaluating a polynomial with real coefficients at a complex point z. Compare (3) to 
Horner’s rule when both the coefficients and the variable z are complex numbers; how 
many (real) multiplications and addition-subtractions are required by each method? 

5. [M15] Count the number of multiplications and additions required by the second- 
order rule (4). 

6. [22] (L. de Jong and J. van Leeuwen.) Show how to improve on steps S1, ..., S4 
of the Shaw—Traub algorithm by computing only about in powers of xo. 

7. [M25] How can £o, ..., Bn be calculated so that (6) has the value u(xo + kh) for 
all integers k? 

8. [M20] The factorial power xë is defined to be k!(Ẹ) = a(a — 1)... (x — k + 1). 
Explain how to evaluate unx™ +--+ uiri + uo with at most n multiplications and 
2n — 1 additions, starting with x and the n + 3 constants un, ..., uo, 1, n— 1. 


9. [M25] (H. J. Ryser.) Show that if X = (xij) is an n x n matrix, then 


per(X) = Soyer J] D yry 


l<i<n 1<j<n 


summed over all 2” choices of €1, ..., €n equal to 0 or 1 independently. Count the 
number of addition and multiplication operations required to evaluate per(X) by this 
formula. 


10. [M21] The permanent of an n x n matrix X = (zij) may be calculated as follows: 
Start with the n quantities 711, £12, ..., Yin. For 1 < k < n, assume that the (7) 
quantities Axs have been computed, for all k-element subsets S of {1,2,...,n}, where 
Ars = } Tiji .-- 3, Summed over all k! permutations jı... jg of the elements of S; 
then form all of the sums 

Awts = D> Ak) Pan 

jes 

We have per(X) = Anqi,....n}- How many additions and multiplications does this 
method require? How much temporary storage is needed? 


v 
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11. [M46] Is there any way to evaluate the permanent of a general n x n matrix using 
fewer than 2” arithmetic operations? 


12. [M50] What is the minimum number of multiplications required to form the 
product of two n x n matrices? What is the smallest exponent w such that O(n”T‘) 
multiplications are sufficient for all € > 0? (Find good upper and lower bounds for 
small n as well as large n.) 


13. [M23] Find the inverse of the general discrete Fourier transform (37), by express- 
ing F(t1,...,tn) in terms of the values of f(s1,..., Sn). [Hint: See Eq. 1.2.9-(13).] 
14. [HM28] (Fast Fourier transforms.) Show that the scheme (40) can be used to 
evaluate the one-dimensional discrete Fourier transform 

F) = 3 Feu, wae") Oo<s<2", 

0<t<27 

using arithmetic on complex numbers. Estimate the number of arithmetic operations 
performed. 


15. [HM28] The nth divided difference f(£0,£1,..., £n) of a function f(x) at n+ 1 
distinct points £o, 21, ..., Yn is defined by the formula 


f (x0, #1, a4 < En) = (f(xo, £1, nee ,Zn—1) ~ f(a, aie ,Zn—1, €n))/(£0 B Tn), 


for n > 0. Thus f(x0,£1,..-, 8n) = peo f(@e)/Tlo<j<n, jżk(£k — £3) is a symmetric 
function of its n + 1 arguments. (a) Prove that f(ao,...,an) = f™ (0)/n!, for some 0 
between min(zo,..., £n) and max(zo, ..., £n), if the nth derivative f™ (x) exists and 
is continuous. [Hint: Prove the identity 


1 tı tn—1 
f(%0,21,---,2n) = fan f ats... dtn f™ (xo(1 tı) H zı(tı t2) per: 
0 0 0 
+ Ln—1(tn-1 = tn) + Ln(tn = 0)). 


This formula also defines f(#o,21,...,@n) in a useful manner when the x, are not 
distinct.] (b) If y; = f(a;), show that a; = f(xo,..., £j) in Newton’s interpolation 
polynomial (42). 

16. [M22] How can we readily compute the coefficients of ujn] (£) = ung” +--+ uo, if 
we are given the values of £o, £1, ..., Yn—1, Q0; M1, --., An in Newton’s interpolation 
polynomial (42)? 

17. [M20] Show that the interpolation formula (45) reduces to a very simple expres- 
sion involving binomial coefficients when £k = zo + kh for0 < k < n. [Hint: See 
exercise 1.2.6-48.] 


18. [M20] If the fourth-degree scheme (9) were changed to 


y=(wtao)et+ar, u(x) = ((y— x + a2)y + a3) 04, 
what formulas for computing the a,’s in terms of the ux’s would take the place of (10)? 


19. [M24] Explain how to determine the adapted coefficients ao, a1, ..., @5 in (11) 
from the coefficients us, ..., u1, Uo of u(x), and find the a’s for the particular poly- 
nomial u(x) = z? + 5a — 102° — 50x? + 13a + 60. 

20. [21] Write a MIX program that evaluates a fifth-degree polynomial according to 
scheme (11); try to make the program as efficient as possible, by making slight mod- 
ifications to (11). Use MIX’s floating point arithmetic operators FADD and FMUL, which 
are described in Section 4.2.1. 
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21. [20] Find two additional ways to evaluate the polynomial zê + 13x5 + 49x4 + 
33x? — 61a? — 374 + 3 by scheme (12), using the two roots of (15) that were not 
considered in the text. 


22. [18] What is the scheme for evaluating xê — 32° + zf — 2r? + x? — 32 — 1, using 
Pan’s method (16)? 


23. [HM30] (J. Eve.) Let f(z) = anz” + @n—12" | +++» + ao be a polynomial of 
degree n with real coefficients, having at least n — 1 roots with a nonnegative real part. 
Let 


n n—2 n mod 2 
g(z) = anz + An-22 +++++ Gn mod 22 ; 


h(z) = Qn—-12" + + n-a Jenat @(n—1) moa get DV mod 2 
Assume that h(z) is not identically zero. 


a) Show that g(z) has at least n — 2 imaginary roots (that is, roots whose real part is 
zero), and h(z) has at least n — 3 imaginary roots. [Hint: Consider the number of 
times the path f(z) circles the origin as z goes around the path shown in Fig. 16, 
for a sufficiently large radius R.] 

b) Prove that the squares of the roots of g(z) = 0 and h(z) = 0 are all real. 


iR 


—iR 
Fig. 16. Proof of Eve’s theorem. 


> 24. [M24] Find values of c and ag, p satisfying the conditions of Theorem E, for the 
polynomial u(x) = (x + 7)(x? + 6x + 10) (x? + 4a + 5)(x + 1). Choose these values so 
that 82 = 0. Give two different solutions. 


25. [M20] When the construction in the proof of Theorem M is applied to the (ineffi- 
cient) polynomial chain 


At =a1+Ao, A2 = — ào — Ao, A3 = ài + A1, As = a2 X Az, 
As = Ao — Ao, Ae = a6 — As, Ar = a7 X Ab, As = Az X A7, 
Ag = Ai X Aa, A1o = ag — Ag, Ai = A3 — A10, 


how can 61, B2, ..., Bg be expressed in terms of aj, ..., Ag? 


> 26. [M21] (a) Give the polynomial chain corresponding to Horner’s rule for evaluating 
polynomials of degree n = 3. (b) Using the construction that appears in the text’s proof 
of Theorem A, express 41, K2, K3, and the result polynomial u(x) in terms of (1, (2, 
B83, Ba, and x. (c) Show that the result set obtained in (b), as 61, G2, 63, and Ga 
independently assume all real values, omits certain vectors in the result set of (a). 


27. [M22] Let R bea set that includes all (n+1)-tuples (qn, ..- - , q1, qo) of real numbers 
such that qn Æ 0; prove that R does not have at most n degrees of freedom. 
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28. [HM20] Show that if fo(a1,...,a@s),.--, fs(Qi,-.-,@s) are multivariate polyno- 
mials with integer coefficients, then there is a nonzero polynomial g(xo,...,%s) with 
integer coefficients such that g(fo(a1,...,@s),---,fs(@1,...,@s)) = 0 for all real a1, 


..; @s. (Hence any polynomial chain with s parameters has at most s degrees of 
freedom.) [Hint: Use the theorems about “algebraic dependence” that are found, for 
example, in B. L. van der Waerden’s Modern Algebra, translated by Fred Blum (New 
York: Ungar, 1949), Section 64.] 


29. [M20] Let Ri, R2, ..., Rm all be sets of (n + 1)-tuples of real numbers having at 
most t degrees of freedom. Show that the union Ri U R2U---U Rm also has at most t 
degrees of freedom. 


30. [M28] Prove that a polynomial chain with me chain multiplications and mp 
parameter multiplications has at most 2m. + mp + dom, degrees of freedom. [Hint: 
Generalize Theorem M, showing that the first chain multiplication and each parameter 
multiplication can essentially introduce only one new parameter into the result set.] 


31. [M23] Prove that a polynomial chain capable of computing all monic polynomials 
of degree n has at least |n/2| multiplications and at least n addition-subtractions. 


32. [M24] Find a polynomial chain of minimum possible length that can compute all 
polynomials of the form uax* + uz”? + uo; and prove that its length is minimal. 


33. [M25] Let n > 3 be odd. Prove that a polynomial chain with |n/2| + 1 multi- 
plication steps cannot compute all polynomials of degree n unless it has at least n + 2 
addition-subtraction steps. [Hint: See exercise 30.] 


34. [M26] Let Ao, Ai, ..., Ar be a polynomial chain in which all of the addition 
and subtraction steps are parameter steps, and in which there is at least one parameter 
multiplication. Assume that this scheme has m multiplications and k = r—m addition- 
subtractions, and that the polynomial computed by the chain has maximum degree n. 
Prove that all polynomials computable by this chain, for which the coefficient of x” is 
not zero, can be computed by another chain that has at most m multiplications and 
at most k additions, and no subtractions; furthermore the last step of the new chain 
should be the only parameter multiplication. 


35. [M25] Show that any polynomial chain that computes a general fourth-degree 
polynomial using three multiplications must have at least five addition-subtractions. 
[Hint: Assume that there are only four addition-subtractions, and show that exer- 
cise 34 applies; therefore the scheme must have a particular form that is incapable of 
representing all fourth-degree polynomials. ] 


36. [M27] Continuing the previous exercise, show that any polynomial chain that 
computes a general sixth-degree polynomial using only four multiplications must have 
at least seven addition-subtractions. 


37. [M21] (T. S. Motzkin.) Show that “almost all” rational functions of the form 
(Unxv” + Une tb bee bet uo)/(@”" + Un 1a bt fee bat vo), 


with coefficients in a field S, can be evaluated using the scheme 


ay Bil (x H a2 + b2/(£ +--+ Br/(a4 Qn41)---)), 


for suitable a;, 6; in S. (This continued fraction scheme has n divisions and 2n 
additions; by “almost all” rational functions we mean all except those whose coefficients 
satisfy some nontrivial polynomial equation.) Determine the a’s and 8’s for the rational 
function (a? + 10a + 29) /(a? + 8a + 19). 
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> 38. [HM32] (V. Y. Pan, 1962.) The purpose of this exercise is to prove that Horner’s 
rule is really optimal if no preliminary adaptation of coefficients is made; we need n 
multiplications and n additions to compute ungt” + ---+uix + uo, if the variables un, 
.., U1, Uo, £, and arbitrary constants are given. Consider chains that are as before 
except that un, ..., U1, Wo, £ are each considered to be variables; we may say, for 
example, that A-;-1 = uj, Ao = x. In order to show that Horner’s rule is best, it is 
convenient to prove a somewhat more general theorem: Let A = (aij), 0<i <m, 
0< j < n, be an (m+ 1) x (n+ 1) matrix of real numbers, of rank n + 1; and let 


B = (bo,...,bm) be a vector of real numbers. Prove that any polynomial chain that 
computes 
P(x; uo, a ,Un) = S (aiou +++ AinUn + bi) a’ 
i=0 


involves at least n chain multiplications. (Note that this does not mean only that 
we are considering some fixed chain in which the parameters a; are assigned values 
depending on A and B; it means that both the chain and the values of the a’s may 
depend on the given matrix A and vector B. No matter how A, B, and the values 
of a; are chosen, it is impossible to compute P(x; uo, ..., un) without doing n “chain- 
step” multiplications.) The assumption that A has rank n + 1 implies that m > n. 
[Hint: Show that from any such scheme we can derive another that has fewer chain 
multiplications and that has n decreased by one.] 


39. [M29] (T. S. Motzkin, 1954.) Show that schemes of the form 


wi = s(x + a1) + fr, Wk = We-1(Wi + Yee tax) + ona + Be forl<k<m, 


where the ax, Bk are real and the yk, ôk are integers, can be used to evaluate all monic 
polynomials of degree 2m over the real numbers. (We may have to choose ax, Bk, Yk, 
and 6, differently for different polynomials.) Try to let 6, = 0 whenever possible. 


40. [M41] Can the lower bound in the number of multiplications in Theorem C be 
raised from |n/2| +1 to [n/2] +1? (See exercise 33.) 


41. [22] Show that the real and imaginary parts of (a + bi)(c + di) can be obtained 
by doing 3 multiplications and 5 additions of real numbers, where two of the additions 
involve a and b only. 


42. [36] (M. Paterson and L. Stockmeyer.) (a) Prove that a polynomial chain with 
m > 2 chain multiplications has at most m? + 1 degrees of freedom. (b) Show that for 
all n > 2 there exist polynomials of degree n, all of whose coefficients are 0 or 1, that 
cannot be evaluated by any polynomial chain with fewer than |,/n| multiplications, if 
we require all parameters a; to be integers. (c) Show that any polynomial of degree n 
with integer coefficients can be evaluated by an all-integer algorithm that performs at 
most 2|,/n| multiplications, if we don’t care how many additions we do. 


43. [22] Explain how to evaluate x” +---+ x+ 1 with 2l(n + 1) — 2 multiplications 
and [(n +1) additions (no divisions or subtractions), where l(n) is the function studied 
in Section 4.6.3. 


> 44. [M25] Show that any monic polynomial u(x) = £” + un-1%”71 +---+ uo can be 
evaluated with in + O(logn) multiplications and < en additions, using parameters 
Qi, Q2,... that are polynomials in un—1, Un—2, ... with integer coefficients. [Hint: 
Consider first the case n = 2',] 
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> 45. [HM22] Let (ti;,) be anmxnx s tensor, and let F, G, H be nonsingular matrices 
of respective sizes m X m,n Xn, s x s. If 


m n s 
Tijk = Dra y1 pa Piw G55" Hpk’ bir ye! 


for all i, j, k, prove that the tensor (Tijk) has the same rank as (tijk). [Hint: Consider 
what happens when F~', G~', H7! are applied in the same way to (Ti;x).] 


46. [M28] Prove that all pairs (21, 22) of bilinear forms in (#1, 72) and (y1, y2) can be 
evaluated with at most three chain multiplications. In other words, show that every 
2 x 2 x 2 tensor has rank < 3. 


47. [M25] Prove that for all m, n, and s there exists an m x n x s tensor whose rank 
is at least [mns/(m +n + s)|. Conversely, show that every m x n x s tensor has rank 
at most mns/max(m, n, s). 


48. [M21] If (tijk) and (t),,) are tensors of sizes mx nx sand m’ xn’ xs’, respectively, 
their direct sum (tijk) ® (tijk) = (tijk) is the (m +m’) x (n +n’) x (s + s’) tensor 
defined by tijk = tijk ifi < m, j < n, k < s; thy, = t-m j-n, k-s fi > m, j >n, 
k > s; and ti, = 0 otherwise. Their direct product (tijk) ® (tijk) = (tijk) is the 
mm’ x nn’ x ss' tensor defined by t(:i7) (557) (ke’) = tigatiyj7p7- Derive the upper bounds 
rank(t;jp) < rank(tijx) + rank(t),;,) and rank(t/,) < rank(tijx) < rank(t);,). 

> 49. [HM25] Show that the rank of an m x n x 1 tensor (ti;~) is the same as its rank 


as an m x n matrix (tjj1), according to the traditional definition of matrix rank as the 
maximum number of linearly independent rows. 


50. [HM20] (S. Winograd.) Let (tijk) be the mn x n x m tensor corresponding to 
multiplication of an m x n matrix by an n x 1 column vector. Prove that the rank of 
(tijk) is mn. 


> 51. [M24] (S. Winograd.) Devise an algorithm for cyclic convolution of degree 2 that 
uses 2 multiplications and 4 additions, not counting operations on the xi. Similarly, 
devise an algorithm for degree 3, using 4 multiplications and 11 additions. (See (69), 
which solves the analogous problem for degree 4.) 


52. [M25] (S. Winograd.) Let n = n'n” where n’ L n”. Given normal schemes for 
cyclic convolutions of degrees n’ and n”, using respectively (m’, m”) chain multiplica- 
tions, (p’, p”) parameter multiplications, and (a’, a”) additions, show how to construct 
a normal scheme for cyclic convolution of degree n using m'm” chain multiplications, 


p'n” +m'p” parameter multiplications, and a'n” + m'a” additions. 


53. [HM40] (S. Winograd.) Let w be a complex mth root of unity, and consider the 
one-dimensional discrete Fourier transform 


f(s)=S FA, forl<s<m. 
t=1 


a) When m = p° is a power of an odd prime, show that efficient normal schemes 
for computing cyclic convolutions of degrees (p — 1)p*, for 0 < k < e, will lead to 
efficient algorithms for computing the Fourier transform on m complex numbers. 
Give a similar construction for the case p = 2. 

b) When m = m'm” and m’ L m”, show that Fourier transformation algorithms 
for m’ and m” can be combined to yield a Fourier transformation algorithm for 
m elements. 
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54. [M23] Theorem W refers to an infinite field. How many elements must a finite 
field have in order for the proof of Theorem W to be valid? 


55. [HM22] Determine the rank of tensor (74) when P is an arbitrary n x n matrix. 


56. [M32] (V. Strassen.) Show that any polynomial chain that evaluates a set of 
quadratic forms X`; ) j- Tijkvit; for 1 < k < s must use at least srank(Tijk + Tjik) 
chain multiplications altogether. [Hint: Show that the minimum number of chain 
multiplications is the minimum rank of (tijk) taken over all tensors (tij) such that 
tijk + tjik = Tijk + Tjik for all i, j, k.] Conclude that if a polynomial chain evaluates a 
set of bilinear forms (47) corresponding to a tensor (tijk), whether normal or abnormal, 
it must use at least Żrank(t;jx) chain multiplications. 


57. [M20] Show that fast Fourier transforms can be used to compute the coefficients of 
the product z(u)y(u) of two given polynomials of degree n, using O(n log n) operations 
of (exact) addition and multiplication of complex numbers. [Hint: Consider the product 
of Fourier transforms of the coefficients. ] 


58. [HM28] (a) Show that any realization (A,B,C) of the polynomial multiplication 
tensor (55) must have the following property: Any nonzero linear combination of the 
three rows of A must be a vector with at least four nonzero elements; and any nonzero 
linear combination of the four rows of B must have at least three nonzero elements. 
(b) Find a realization (A, B,C) of (55) that uses only 0, +1, and —1 as elements, where 
r = 8. Try to use as many Os as possible. 


59. [M40] (H. J. Nussbaumer, 1980.) The text defines the cyclic convolution of two 
sequences (£o, £1,- -,€n—1) and (Yo, yi,---;Yn—1) to be the sequence (20, 21,.--; Zn—1) 
where Zk = oye +--+ +2nYot@e41Yn—-1 +++: +En—1Yk+1. Let us define the negacyclic 
convolution similarly, but with 


Zk = LoOYk +`: + GRYO — (Le41Yn—1 +++ + En—1Yk+1)- 


Construct efficient algorithms for cyclic and negacyclic convolution over the integers 
when n is a power of 2. Your algorithms should deal entirely with integers, and 
they should perform at most O(n log n) multiplications and at most O(n log n log log n) 
additions or subtractions or divisions of even numbers by 2. [Hint: A cyclic convolution 
of order 2n can be reduced to cyclic and negacyclic convolutions of order n, using (59).] 


60. [M27] (V. Y. Pan.) The problem of (m x n) times (n x s) matrix multiplication 
corresponds to an mn X ns X sm tensor (tii, j’y(j,k’y (kit) ) Where (5,57) (5,47) (ki) = 1 if 
and only if i’ = i and j’ = j and k’ = k. The rank of this tensor T(m,n,s) is the 
smallest number r such that numbers ajj/1, bj4/1, Chit exist satisfying 


5 Vig YjkŽki = 5 ( 5 agna) ( 5 bwu) ( 5 ckini zk )- 


1<i<m 1<I<r 1<i<m 1<j<n 1<k<s 
1<j<n 1<j’<n 1<k’<s 1<i’<m 
1<k<s 


Let M(n) be the rank of T(n,n,n). The purpose of this exercise is to exploit the 
symmetry of such a trilinear representation, obtaining efficient realizations of matrix 
multiplication over the integers when m = n = s = 2v. For convenience we divide 
the indices {1,...,n} into two subsets O = {1,3,...,n — 1} and E = {2,4,...,n} of 
v elements each, and we set up a one-to-one correspondence between O and E by the 
rulei=i+lifie O;7=i-—1ific E. Thus we have 7 =i for all indices i. 
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a) The identity 
abc + ABC = (a+ A)(b+ B)(c+ C) — (a+ A)bC — A(b + B)c—aB(c+C) 


implies that 


5 TijYjkZki = 5 (Lij + Dez) (Yjk + Yaz) (zki + zzz) — X1 — X2 — Vs, 
1<i,j,k<n (4,7, RIES 
where S = EXEXEUEXEXOU EXOxEU OXxEXxXE is the set of all triples of 
indices containing at most one odd index; X4 is the sum of all terms of the form 
(wij + vai) yje 27x for (i,j,k) € S; and X2, X3 similarly are sums of the terms 
TRIlYjk + Yaz) Zki, Lig Yaz(Zki + zzz). Clearly S has 4v? = $n? terms. Show that 
each of X1, X2, X3 can be realized as the sum of 3v” trilinear terms; furthermore, 
if the 3v triples of the forms (i, 7,7) and (7,7,7) and (7, i, i) are removed from S, we 
can modify X1, X2, and X3 in such a way that the identity is still valid, without 
adding any new trilinear terms. Thus M(n) < 4n? + $n? — 3n when n is even. 
b) Apply the method of (a) to show that two independent matrix multiplication 
problems of the respective sizes m x n x s and s Xx m x n can be performed with 
mns + mn + ns + sm noncommutative multiplications. 


61. [M26] Let (tijk) be a tensor over an arbitrary field. We define rankg(t;;,) as the 
minimum value of r such that there is a realization of the form 


X au(u)bjlu)cri(u) = tije? + Out), 
t=1 


where au(u), bj1(u), ckilu) are polynomials in u over the field. Thus ranko is the 
ordinary rank of a tensor. Prove that 
a) ranka+ı (tijk) < ranka(tijk); 
) rank(tijk) < (T2) ranka(tijk); 
) ranka( (tijk) © (tijk)) < ranka(tijn) + ranka(t;;,), in the sense of exercise 48; 
d) ranka+a (tisk) Q (tijn)) < ranka(tije) : ranka (tijr); 
) rankapa ((tijk) Q (tiyn)) < ranka (r(tijg)), where r = ranka(tijk) and rT denotes 
the direct sum T ®---®T of r copies of T. 
62. [M24] The border rank of (tijk), denoted by rank(ti;~), is ming>o ranka(tijk), 
where ranka is defined in exercise 61. Prove that the tensor ( 3 2) (3 a) has rank 3 but 
border rank 2, over every field. 
63. [HM30] Let T(m,n,s) be the tensor for matrix multiplication as in exercise 60, 
and let M(N) be the rank of T(N, N, N). 
a) Show that T(m, n, s) T(M, N, S) = T(mM,nN, s85). 
b) Show that ranka(T (mN, nN, sN)) < ranka(M(N)T(m,n, s)) (see exercise 61(e)). 
c) If T(m,n, s) has rank < r, show that M(N) = O(N®°™™*")) as N > 00, where 
w(m,n, s,r) = 3logr/log mns. 

d) If T(m,n, s) has border rank < r, show that M(N) = O(N®°™™5:" (log N)?). 
64. [M30] (A. Schönhage.) Show that rank2(T(3,3,3)) < 21, so M(N) = O(N?8). 
> 65. [M27] (A. Schönhage.) Show that rankə(T(m,1,n) 6 T(1, (m—1)(n—1),1)) = 

mn + 1. Hint: Consider the trilinear form 


XOY (ai + wX iy) (yy + wYig)(Z + uziz) — (£1 +: + em) + + Yn) Z 


i=1 j=1 
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when Hen Xij = Pe) Yay = 0: 
66. [HM33| We can now use the result of exercise 65 to sharpen the asymptotic bounds 
of exercise 63. 

a) Prove that the limit w = limn—.. log M(n)/logn exists. 

b) Prove that (mns)°/? < rank(T(m, n, s)). 

c) Let t be the tensor T(m,n,s) @T(M, N, S). Prove that (mns)*/? + (MNS)*/? < 

rank(t). Hint: Consider direct products of t with itself. 
d) Therefore 16°/° + 9%/3 < 17, and we have w < 2.55. 


67. [HM40] (D. Coppersmith and S. Winograd.) By generalizing exercises 65 and 66 
we can obtain even better upper bounds on w. 

a) Say that the tensor (tijk) is nondegenerate if rank(t,(j;,)) = m, rank(t;(m:)) = n, 
and rank(t(;;)) = s, in the notation of Lemma T. Prove that the tensor T(m, n, s) 
for mn x ns matrix multiplication is nondegenerate. 

b) Show that the direct sum of nondegenerate tensors is nondegenerate. 

c) An m x n x s tensor t with realization (A,B,C) of length r is called improv- 
able if it is nondegenerate and there are nonzero elements di, ..., dr such that 
Yaa aiubjidı = 0 for 1 < i < m and 1 < j < n. Prove that in such a 
case t ® T(1,q,1) has border rank < r, where q = r—m-—n. Hint: There 
are q X r matrices V and W such that X; vibjidi = X; @uwjidi = 0 and 
Xi vawjidi = ij for all relevant i and j. 

d) Explain why the result of exercise 65 is a special of (c). 

e) Prove that rank(T(m, n, s)) < r implies 


ranks(T(m,n,s) E T(1,r — n(m +s -— 1),1)) <r+n. 


f) Therefore w is strictly less than log M (n)/logn for all n > 1. 

g) Generalize (c) to the case where (A, B,C) realizes t only in the weaker sense of 
exercise 61. 

h) From (d) we have rank(T(3, 1,3) 6 T(1,4,1)) < 10; thus by exercise 61(d) we also 
have rank(7(9,1,9) @ 27(3,4,3) $ T(1,16,1)) < 100. Prove that if we simply 
delete the rows of A and B that correspond to the 16 + 16 variables of T(1, 16,1), 
we obtain a realization of T(9,1,9) © 2T(3, 4,3) that is improvable. Therefore we 
have in fact rank(T(9, 1,9) @ 2T(3, 4,3) @ T(1, 34, 1)) < 100. 

i) Generalizing exercise 66(c), show that 


t 


t 
So (mpnpsp)” < rank(@® T(mp, Np, s»)) i 


p=1 p=1 
j) Therefore w < 2.5. 
68. [M45] Is there a way to evaluate the polynomial 
5 Li£j = LiLo +`: + Ln-1Tn 
1<i<j<n 
with fewer than n — 1 multiplications and 2n — 4 additions? (There are (3) terms.) 


> 69. [HM27] (V. Strassen, 1973.) Show that the determinant (31) of an n x n matrix 
can be evaluated by doing O(n) multiplications and O(n*) additions or subtractions, 
and no divisions. [Hint: Consider det(I + Y) where Y = X — I] 
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> 70. [HM25] The characteristic polynomial fx(A) of a matrix X is defined to be 


det (AI — X). Prove that if X = (z wh where X, u, v, and Y are respectively of 


sizes n X n, 1 x (n — 1), (n — 1) x 1, and (n — 1) x (n — 1), we have 


UU UY v 2 
f(a) = fea) (A z-” x ca e) 


Show that this relation allows us to compute the coefficients of fx with about int 
multiplications, inf addition-subtractions, and no divisions. Hint: Use the identity 


(¢ a)=(0 D) (IT © 7) (ote t) 


which holds for any matrices A, B, C, and D of respective sizes l x l, 1x m, m x l, and 
m x m when D is nonsingular. 


> 71. [HM30| A quolynomial chain is like a polynomial chain except that it allows 
division as well as addition, subtraction, and multiplication. Prove that if f(£1,..., £n) 
can be computed by a quolynomial chain that has m chain multiplications and d di- 
visions, then f(z£1,..., 2n) and all n of its partial derivatives ðf (x1, ..., £n)/Ə£k for 
1 < k < n can be computed by a single quolynomial chain that has at most 3m-+d chain 
multiplications and 2d divisions. (Consequently, for example, any efficient method for 
calculating the determinant of a matrix leads to an efficient method for calculating all 
of its cofactors, hence an efficient method for computing the inverse matrix.) 


72. [M48] Is it possible to determine the rank of any given tensor (tijk) over, say, the 
field of rational numbers, in a finite number of steps? 


73. [HM25] (J. Morgenstern, 1973.) Prove that any polynomial chain for the discrete 
Fourier transform (37) has at least imi -Mn lg Mı ... Mn addition-subtractions, if 
there are no chain multiplications and if every parameter multiplication is by a complex- 
valued constant with |a;| < 1. Hint: Consider the matrices of the linear transforma- 


tions computed by the first k steps. How big can their determinants be? 


74. [HM35] (A. Nozaki, 1978.) Most of the theory of polynomial evaluation is con- 
cerned with bounds on chain multiplications, but multiplication by noninteger constants 
can also be essential. The purpose of this exercise is to develop an appropriate theory 
of constants. Let us say that vectors v1, ..., Us of real numbers are Z-dependent if 
there are integers (kı, ..., ks) such that gcd(ki,...,ks) = 1 and kivi +---+ksvs is an 
all-integer vector. If no such (k1, ..., ks) exist, the vectors v1, ..., Vs are Z-independent. 
a) Prove that if the columns of an r x s matrix V are Z-independent, so are the 
columns of VU, when U is any s x s unimodular matrix (a matrix of integers 
whose determinant is +1). 
b) Let V be an r x s matrix with Z-independent columns. Prove that a poly- 


nomial chain to evaluate the elements of Vx from inputs z1, ..., £s, where 
x = (£1,..., £s)”, needs at least s multiplications. 

c) Let V be an r x t matrix having s columns that are Z-independent. Prove that 
a polynomial chain to evaluate the elements of Vx from inputs 71, ..., £t, where 
x = (#1,...,24)", needs at least s multiplications. 


d) Show how to compute the pair of values {x/2 + y, x + y/3} from « and y using 
only one multiplication, although two multiplications are needed to compute the 


pair {x/2 + y, £ + y/2}. 
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*4.7. MANIPULATION OF POWER SERIES 


IF WE ARE GIVEN two power series 


U(2) = Uo +U +U,  V(2)= W+ Viz + Ve? +, (1) 


whose coefficients belong to a field, we can form their sum, their product, and 
sometimes their quotient, to obtain new power series. A polynomial is obviously 
a special case of a power series, in which there are only finitely many terms. 

Of course, only a finite number of terms can be represented and stored 
within a computer, so it makes sense to ask whether power series arithmetic 
is even possible on computers; and if it is possible, what makes it different 
from polynomial arithmetic? The answer is that we work with only the first N 
coefficients of the power series, where N is a parameter that may in principle be 
arbitrarily large; instead of ordinary polynomial arithmetic, we are essentially 
doing polynomial arithmetic modulo z^, and this often leads to a somewhat 
different point of view. Furthermore, special operations like “reversion” can be 
performed on power series but not on polynomials, since polynomials are not 
closed under those operations. 

Manipulation of power series has many applications to numerical analysis, 
but perhaps its greatest use is the determination of asymptotic expansions (as we 
have seen in Section 1.2.11.3), or the calculation of quantities defined by certain 
generating functions. The latter applications make it desirable to calculate 
the coefficients exactly, instead of with floating point arithmetic. All of the 
algorithms in this section, with obvious exceptions, can be done using rational 
operations only, so the techniques of Section 4.5.1 can be used to obtain exact 
results when desired. 

The calculation of W(z) = U(z) + V(z) is, of course, trivial, since we have 
Wr = [2"] W (z2) = Un + Vn for n = 0, 1, 2, .... It is also easy to calculate the 
coefficients of W (z) = U(z)V(z), using the familiar convolution rule 


Wr = Y Uh Var = UoVn + Ui Vai +++ + UnVo. (2) 
k=0 
The quotient W (z) = U(z)/V(z), when Vo ¥ 0, can be obtained by inter- 
changing U and W in (2); we obtain the rule 


n-1 
Wr = (v -5 Wi¥a-t) I Vo 


k=0 
= (Un — WoVn — WiVn—1 — +++ — Wn-1Vi)/Vo- (3) 


This recurrence relation for the W’s makes it easy to determine Wọ, W1, Wo, ... 
successively, without inputting U,, and V, until after W,,_; has been computed. 
A power series manipulation algorithm with that property is traditionally called 
online; with an online algorithm, we can determine N coefficients Wo, Wi, ..., 
Wy_-1 of the result without knowing N in advance, so we could in principle run 
the algorithm indefinitely and compute the entire power series. We can also run 
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an online algorithm until any desired condition is met. (The opposite of “online” 
is “offline.”) 

If the coefficients Up and V; are integers but the W; are not, the recurrence 
relation (3) involves computation with fractions. This can be avoided by the 
all-integer approach described in exercise 2. 

Let us now consider the operation of computing W (z) = V(z)°, where a is 
an “arbitrary” power. For example, we could calculate the square root of V(z) 
by taking a = 4, or we could find V(z)~!° or even V(z)”. If Vm is the first 
nonzero coefficient of V(z), we have 


V(z) = Vin UL + (Vm+1/Vm)z + (Vm+2/Vm)2? +e J, 
V(z)* = Vin auld + (Vm+1/Vm)z + (Vmn+2/Vm)2? a J)e. 


This will be a power series if and only if am is a nonnegative integer. If œ itself 
is not an integer, there’s more than one possibility for V0 z°™ here. 

From (4) we can see that the problem of computing general powers can be 
reduced to the case that Vo = 1; then the problem is to compute the coefficients of 


W(z) = (14+ Viz + Voz? + V32? +). (5) 


Clearly Wo = 1% = 1. 

The obvious way to find the coefficients of (5) is to use the binomial theorem, 
Eq. 1.2.9—-(19), or (if a is a positive integer) to try repeated squaring as in Section 
4.6.3. But Leonhard Euler discovered a much simpler and more efficient way to 
obtain power series powers [Introductio in Analysin Infinitorum 1 (1748), §76]: 
If W(z) = V(z)*, we have by differentiation 


(4) 


Wi + 2Woz + 3W327 +--+» = W' (z2) = aV(z)*1V"(z); (6) 
therefore 
W'(z)V(z) = aW(z)V"(z). (7) 
If we now equate the coefficients of z”~! in (7), we find that 
So kW Van-e = a(n — k)WkVn-k, (8) 
k=0 k=0 


and this gives us a useful computational rule valid for all n > 1: 


on S(t 
k=1 


= ((at1—n)Vi Whi + (2a+2—n)V2Wy_2 +: + naV Wo)/n. (9) 


Equation (9) leads to a simple online algorithm by which we can successively 
determine W1, Wo, ..., using approximately 2n multiplications to compute the 
nth coefficient. Notice the special case œ = —1, in which (9) becomes the special 
case U(z) = Vo = 1 of (3). 

A similar technique can be used to form f(V(z)) when f is any function 
that satisfies a simple differential equation. (For example, see exercise 4.) A 
comparatively straightforward “power series method” is often used to obtain 
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—v 
L1. Initialize 


] a 
L2. Input Vn L3. Divide Y L4. Output Wn 


n>N 
y 


Fig. 17. Power series reversion by Algorithm L. 


the solution of differential equations; this technique is explained in nearly all 
textbooks about differential equations. 


Reversion of series. The transformation of power series that is perhaps of 
greatest interest is called “reversion of series.” This problem is to solve the 
equation 


z= t+ Vet? + Vat? + Vatt +e (10) 
for t, obtaining the coefficients of the power series 
t= z + Woz? + W32? + Waz +. (11) 


Several interesting ways to achieve such a reversion are known. We might 
say that the “classical” method is one based on Lagrange’s remarkable inversion 
formula [Mémoires Acad. Royale des Sciences et Belles-Lettres de Berlin 24 
(1768), 251-326], which states that 


1 
[tt] (1 + Vat + Vat? +). (12) 


n 


Wn 


For example, we have (1—t)~° = (4) + (3) t+ (§)t?+---; hence the fifth coefficient, 
W;, in the reversion of z = t — t? is equal to $) /5 = 14. This checks with the 
formulas for enumerating binary trees in Section 2.3.4.4. 

Relation (12), which has a simple algorithmic proof (see exercise 16), shows 
that we can revert the series (10) if we successively compute the negative powers 
(1 + Vot + V3t? +)” for n = 1, 2, 3,.... A straightforward application of 
this idea would lead to an online reversion algorithm that uses approximately 
N?/2 multiplications to find N coefficients, but Eq. (9) makes it possible to work 
with only the first n coefficients of (1 + Vot + V3t? +---)~", obtaining an online 
algorithm that requires only about N?/6 multiplications. 


Algorithm L (Lagrangian power series reversion). This online algorithm inputs 
the value of V, in (10) and outputs the value of W, in (11), for n = 2, 3, 4, 

.., N. (The number N need not be specified in advance; any desired termination 
criterion may be substituted.) 


L1. [Initialize] Set n + 1, Up + 1. (The relation 
(1 + Vot + Val? +) = Up + Ut ++ Unat 1+ O) (a3) 


will be maintained throughout this algorithm.) 
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L2. [Input V,,.] Increase n by 1. If n > N, the algorithm terminates; otherwise 
input the next coefficient, Vn. 


L3. (Divide. ] Set Uk —_— Uk = Uk-1 V2 Sarem Ui Vk = UoVk+1, for k = 1, 2, sieo 
n — 2 (in this order); then set 
Un-1 m —2Un-2 V2 a 3Un-3 V3 S (n T 1)U1Vn—1 = nUoVn. 


(We have thereby divided U (z) by V(z)/z; see (3) and (9).) 
L4. [Output W,,.] Output Un—1/n (which is W,,) and return to L2. | 
When applied to the example z = t — t?, Algorithm L computes 
n Vn Uo Uy U2 U3 U4 Wn 


1 1 1 1 
2 -1 1 2 1 
3 0 1 3 6 2 
4 0 1 4 10 20 5 
5 0 1 5 15 35 70 14 


Exercise 8 shows that a slight modification of Algorithm L will solve a consider- 
ably more general problem with only a little more effort. 


Let us now consider solving the equation 


Viet Use + U32 +- SF EV fe (14) 

for t, obtaining the coefficients of the power series 
t = Wiz + Woz? + W32? + Wazt tee. (15) 
Eq. (10) is the special case U = 1, U2 = U3 = --- = 0. If U, Æ 0, we may 


assume that U: = 1, if we replace z by (U;z); but we shall consider the general 
equation (14), since U, might equal zero. 

Algorithm T (General power series reversion). This online algorithm inputs 
the values of U,, and V, in (14) and outputs the value of Wn in (15), for n = 1, 2, 
3,..., N. An auxiliary matrix Tmn, L < m < n < N, is used in the calculations. 
T1. [Initialize.]| Set n «+ 1. Let the first two inputs (namely, U; and V1) be 
stored in T;; and Vj, respectively. (We must have V; = 1.) 

T2. [Output W,,.] Output the value of Tın (which is Wn). 

T3. [Input Un, Vn.| Increase n by 1. If n > N, the algorithm terminates; 
otherwise store the next two inputs (namely, Un and Vn) in Ti, and Vn. 
T4. [Multiply.] Set 


Trn = T11Tm—1,n-1 + T12Tm-—1,n—2 pos Tinm mimi 
and Tin < Tin — VmTImn, fr 2 <M <n. (After this step we have 
f= Tae tee eet Tan" +O, (16) 


for 1 < m <n. It is easy to verify (16) by induction for m > 2, and when 
m = 1, we have Un = Tin + V2Ton +- +- + VnTnn by (14) and (16).) Return 
to step T2. I 
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Equation (16) explains the mechanism of this algorithm, which is due to 
Henry C. Thacher, Jr. [CACM 9 (1966), 10-11]. The running time is essentially 
the same as Algorithm L, but considerably more storage space is required. An 
example of this algorithm is worked out in exercise 9. 


Still another approach to power series reversion has been proposed by R. P. 
Brent and H. T. Kung [JACM 25 (1978), 581-595], based on the fact that 
standard iterative procedures used to find roots of equations over the real num- 
bers can also be applied to equations over power series. In particular, we can 
consider Newton’s method for computing approximations to a real number t 
such that f(t) = 0, given a function f that is well-behaved near t: If x is a 
good approximation to t, then (x) = x — f(x)/ f'(x) will be even better, for if 
we write x = t + we have f(x) = f(t) + eft) + O(e?), f’(x) = f'(t) + O(e); 
consequently ¢(z) = t +e — (0 + ef’(t) + O(e?))/(f’() + O(6)) = t+ O(e’). 
Applying this idea to power series, let f(x) = V(x) — U(z), where U and V are 
the power series in Eq. (14). We wish to find the power series t in z such that 
f(t) =0. Let x = Wiz +--+ W,_12"" 1 = t+ O(z”) be an “approximation” 
to t of order n; then o(x) = «— f(x)/f'(x) will be an approximation of order 2n, 
since the assumptions of Newton’s method hold for this f and t. 

In other words, we can use the following procedure: 


Algorithm N (General power series reversion by Newton’s method). This “semi- 
online” algorithm inputs the values of Up and Vp in (14) for 2° < n < 2**1! and 
then outputs the values of Wn in (15) for 2* < n < 2*+!, thereby producing its 
answers in batches of 2% at a time, for k = 0, 1, 2,..., K. 


N1. [Initialize.] Set N < 1. (We will have N = 2".) Input the first coefficients 
Uı and V; (where V; = 1), and set W, + U1. 


N2. [Output.] Output Wn for N < n < 2N. 


N3. [Input.] Set N + 2N. If N > 2*, the algorithm terminates; otherwise 
input the values U, and V, for N < n < 2N. 


N4. [Newtonian step.] Use an algorithm for power series composition (see exer- 
cise 11) to evaluate the coefficients Q; and R; (0 < j < N) in the power 


series 
Uız +. + Uson iz N1 = V(Wiz ee Wy_iz*—1) 
= Roz™ + Rie Sela Rae + O(22%), 
Vv’ (Wiz E eee Ww) = Qo + Qız ae ee Qn-12%7! + Of”), 
where V(x) = £+ Vox72+--- and V’(x) =1+2Ver+---. Then set Wy, ..., 


W2n_1 to the coefficients in the power series 


Rot Ryz+--- Rn_12zN7! 
Qo+Qız+:: -+ nae" 


and return to step N2. I 


= Wy +++» + Won_12871 + O(2%) 
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The running time for this algorithm to obtain the coefficients up to N = 2* 
is T(N), where 


T(2N) =T(N) + (time to do step N4) + O(N). (17) 


Straightforward algorithms for composition and division in step N4 will take 
order N? steps, so Algorithm N will run slower than Algorithm T. However, 
Brent and Kung have found a way to do the required composition of power 
series with O(N log N)°/? arithmetic operations, and exercise 6 gives an even 
faster algorithm for division; hence (17) shows that power series reversion can 
be achieved by doing only O(N log N)?/? operations as N — oo. (On the other 
hand the constant of proportionality is such that N must be really large before 
Algorithms L and T lose out to this “high-speed” method.) 

Historical note: J. N. Bramhall and M. A. Chapple published the first O(N?) 
method for power series reversion in CACM 4 (1961), 317-318, 503. It was an 
offline algorithm essentially equivalent to the method of exercise 16, with running 
time approximately the same as that of Algorithms L and T. 


Iteration of series. If we want to study the behavior of an iterative process 
Ln <— f(£n-1), we are interested in studying the n-fold composition of a given 
function f with itself, namely x, = f(f(...f(xo)-..)). Let us define fll(x) = x 
and f(x) = f(f!-4(z)), so that 


fren @ =7 (7 G@)) (18) 


for all integers m, n > 0. In many cases the notation f(x) makes sense 
also when n is a negative integer, namely if f!"] and f[-"! are inverse functions 
such that « = fl"l(fl-"l(x)); if inverse functions are unique, (18) holds for all 
integers m and n. Reversion of series is essentially the operation of finding the 
inverse power series f!~'I(x); for example, Eqs. (10) and (11) essentially state 
that z = V(W(z)) and that t = W(V(t)), so W = VH, 

Suppose we are given two power series V (z) = z + Voz? +--+ and W (z) = 
z+Wo22+---such that W = VI}. Let u be any nonzero constant, and consider 
the function 

U(z) = W(uV(z)). (19) 


It is easy to see that U(U(z)) = W(u?V(z)), and in general that 
ull(z) = W(u"V(z)) (20) 


for all integers n. Therefore we have a simple expression for the nth iterate 
Ul"), which can be calculated with roughly the same amount of work for all n. 
Furthermore, we can even use (20) to define U”! for noninteger values of n; the 
“half iterate” U!/2l, for example, is a function such that Ul!/2I(UE/2I(z)) = 
U(z). (There are two such functions Ul!/2], obtained by using yu and —yu as 
the value of u!/2 in (20).) 

We obtained the simple state of affairs in (20) by starting with V and u, then 
defining U. But in practice we generally want to go the other way: Starting with 
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some given function U, we want to find V and u such that (19) holds, namely 
such that 
V(U(z)) = uV(z). (21) 

Such a function V is called the Schréder function of U, because it was introduced 
by Ernst Schröder in Math. Annalen 3 (1871), 296-322. Let us now look at the 
problem of finding the Schröder function V(z) = z+ V22? +--+: of a given power 
series U(z) = Uiz + U22? +--+. Clearly u = U4 if (21) is to hold. 

Expanding (21) with u = U, and equating coefficients of z leads to a 
sequence of equations that begins 


U?V, + Uz = U1 V3, 
UTV; + 2U, U,V, + U3 = U: V3, 
U?V, + 3U?U, V; + 2U,U; V> + U?V + U, = UV, 


and so on. Clearly there is no solution when U; = 0 (unless trivially Uz = U3 = 
--. = 0); otherwise there is a unique solution unless Uj is a root of unity. We 
might have expected that something funny would happen when UF = 1, since 
Eq. (20) tells us that U!"l(z) = z if the Schröder function exists in that case. 
For the moment let us assume that U; is nonzero and not a root of unity; then 
the Schröder function does exist, and the next question is how to compute it 
without doing too much work. 

The following procedure has been suggested by R. P. Brent and J. F. Traub. 
Equation (21) leads to subproblems of a similar but more complicated form, so 
we set ourselves a more general task whose subtasks have the same form: Let us 
try to find V(z) = Vo + Viz + -++ + Vn—12”71 such that 


V(U(z)) = W(z)V(z) + S(z) + O(2”), (22) 


given U(z), W(z), S(z), and n, where n is a power of 2 and U(0) = 0. Ifn=1 
we simply let Vo = $(0)/(1 — W(0)), with Vo = 1 if S(0) = 0 and W(0) = 1. 
Furthermore it is possible to go from n to 2n: First we find R(z) such that 


V(U(z)) = W(2)V(z) + S(2) — 2" R(z) + O(27”). (23) 
Then we compute 
W(z) = W(z)(z/U(2))" + O(2"), S) = R(z)(z/U(z))" + O(2"), (24) 
and find V(z) = Va + Vagiz+-->+ Von—12"7! such that 
V(U(2)) = W(z)V(z) + S(z) + O(2”). (25) 
It follows that the function V*(z) = V(z) + z”V(z) satisfies 
V*(U(z)) = W(z)V*(z) + S(z) + O(2?"), 


as desired. 
The running time T(n) of this procedure satisfies 


T(2n) = 2T(n) + C(n), (26) 
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where C(n) is the time to compute R(z), Ŵ (z), and $(z). The function C(n) is 
dominated by the time to compute V(U(z)) modulo 227, and C (n) presumably 
grows faster than order n'**; therefore the solution T(n) to (26) will be of 
order O(n). For example, if C(n) = cn? we have T(n) ~ $cn; or if C(n) is 
O(nlog n)3/? using “fast” composition, we have T(n) = O(n log n)3/?. 

The procedure breaks down when W(0) = 1 and S(0) 4 0, so we need to 
investigate when this can happen. It is easy to prove by induction on n that 
the solution of (22) by the Brent—Traub method entails consideration of exactly 
n subproblems, in which the coefficient of V(z) on the right-hand side takes 
the respective values W(z)(z/U(z))4 + O(z”) for 0 < j < n in some order. If 
W(0) = U, and if U, is not a root of unity, we therefore have W(0) = 1 only 
when j = 1; the procedure will fail in this case only if (22) has no solution 
for n = 2. 

Consequently the Schréder function for U can be found by solving (22) for 
n= 2, 4, 8,16,..., with W(z) = U; and S(z) = 0, whenever U; is nonzero and 
not a root of unity. 

If U, = 1, there is no Schröder function unless U(z) = z. But Brent and 
Traub have found a fast way to compute U (z) even when U: = 1, by making 
use of a function V(z) such that 


V(U(2)) = U'(2)V (2). (27) 


If two functions U(z) and U(z) both satisfy (27), for the same V, it is easy to 
check that their composition U (O(z)) does too; therefore all iterates of U(z) are 
solutions of (27). Suppose we have U(z) = z+ Up z} + Ug4iz*t! + -+ where 
k > 2 and Uk # 0. Then it can be shown that there is a unique power series 
of the form V(z) = z* + Vagiz"t! + Vigoz**? + --- satisfying (27). Conversely 
if such a function V(z) is given, and if k > 2 and U, are given, then there is a 
unique power series of the form U(z) = z+U,z* +Up412"t! +- - satisfying (27). 
The desired iterate U” (z) is the unique power series P(z) satisfying 


VP) = P'(2)V (2) (28) 


such that P(z) = z + nUpz" +---. Both V(z) and P(z) can be found by 
appropriate algorithms (see exercise 14). 

If Uı is a kth root of unity, but not equal to 1, the same method can be 
applied to the function Ul (z) = z+---, and U!*l(z) can be found from U(z) by 
doing l(k) composition operations (see Section 4.6.3). We can also handle the 
case U; = 0: If U(z) = Up z* + Ugg z*t1 +--+ where k > 2 and Up Æ 0, the idea 
is to find a solution to the equation V(U(z)) = U,V (z)*; then 


UM (a) = vio iE My (zy), (29) 


Finally, if U(z) = Uo + Uiz +--- where Up Æ 0, let a be a “fixed point” such 
that U(a) = a, and let 


U(z) = U(a + z) — a = zU" (a) + 2U"(a)/2+---; (30) 
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then UM (z) = Ôl (z—a)+a. Further details can be found in Brent and Traub’s 
paper [SICOMP 9 (1980), 54-66]. The V function of (27) had previously been 
considered by M. Kuczma, Functional Equations in a Single Variable (Warsaw: 
PWN-Polish Scientific, 1968), Lemma 9.4, and implicitly by E. Jabotinsky a few 
years earlier (see exercise 23). 


Algebraic functions. The coefficients of each power series W(z) that satisfies 
a general equation of the form 


An(2)W(2)" +--+ + Ai(2)W(2) + Ao(z) =0, (31) 


where each A;(z) is a polynomial, can be computed efficiently by using methods 
due to H. T. Kung and J. F. Traub; see JACM 25 (1978), 245-260. See also 
D. V. Chudnovsky and G. V. Chudnovsky, J. Complexity 2 (1986), 271-294; 
3 (1987), 1-25. 


EXERCISES 

1. [M10] The text explains how to divide U(z) by V(z) when Vo # 0; how should 
the division be done when Vo = 0? 

2. [20] If the coefficients of U(z) and V(z) are integers and Vo 0, find a recurrence 
relation for the integers yer. where W,, is defined by (3). How could you use this 
for power series division? 

3. [M15] Does formula (9) give the right results when a = 0? When a = 1? 

4. [HM23] Show that simple modifications of (9) can be used to calculate eY 
Vo = 0, and InV(z) when V = 1. 

5. [M00] What happens when a power series is reverted twice — that is, if the output 
of Algorithm L or T is reverted again? 

6. [M21] (H. T. Kung.) Apply Newton’s method to the computation of W(z) = 
1/V(z), when V(0) # 0, by finding the power series root of the equation f(x) = 0, 
where f(x) = a7! — V (z). 

7. [M23] Use Lagrange’s inversion formula (12) to find a simple expression for the 
coefficient Wn in the reversion of z = t — t™. 

8. [M25] If W(z) = Wiz + Woz? + W32? +--+» = Git + Got? + G3t? +--- = G(t), 
where z = Vit + Vat? 4 Vat? +--- and V; Æ 0, Lagrange proved that 


Wr = al G' (t)/(Vi + Vt + Vat? +--+)”. 


when 


(Equation (12) is the special case Gi = Vi = 1, G2 = G3 = -:- = 0.) Extend 
Algorithm L so that it obtains the coefficients Wi, W2, ...in this more general situation, 
without substantially increasing its running time. 

9. [11] Find the values of Tmn computed by Algorithm T as it determines the first 
five coefficients in the reversion of z = t — t’. 
10. [M20] Given that y = z“ + axt! + agx°t? + ---, a £0, show how to compute 
the coefficients in the expansion x = y!/% + boy?/% + bgy3/% +--+. 


11. [M25] (Composition of power series.) Let 
U(z) = Uo +Uiz + U22? +- and V(z) = Viz + Voz" + Vaz? +- 
Design an algorithm that computes the first N coefficients of U(V(z)). 
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12. [M20] Find a connection between polynomial division and power series division: 
Given polynomials u(x) and u(x) of respective degrees m and n over a field, show how 
to find the polynomials q(x) and r(x) such that u(x) = q(x) u(x)+r(ax) and deg(r) < n, 
using only operations on power series. 

13. [M27] (Rational function approximation.) It is occasionally desirable to find 
polynomials whose quotient has the same initial terms as a given power series. For 
example, if W(z) = 1+ 24 327 + 723 +---, there are essentially four different ways 
to express W (z) as wi(z)/we2(z) + O(z*) where wi(z) and we(z) are polynomials with 
deg(wi) + deg(w2) < 4: 


(l4+ 2432? +723) /1=14+2432747234+02t+---, 
(3 — 4z + 22°) /(3—7z) = 1 +2 +32? +72 + 2244..., 
(1 = z)/ (1—22 — 27) = 1 +z +32? +72? 417244---, 
1/(1— 2-22? — 22°) = 1 + 243274723 + 1524 


Rational functions of this kind are commonly called Padé approximations, since they 
were studied extensively by H. E. Padé [Annales Scient. de l’Ecole Normale Supérieure 
(3) 9 (1892), S1-S93; (3) 16 (1899), 395—426]. 

Show that all Padé approximations W (z) = wi(z)/we(z) + O(2%) with deg(wi) + 
deg(w2) < N can be obtained by applying an extended Euclidean algorithm to the 
polynomials z% and Wo + Wiz +---+Wwy_12% 71; and design an all-integer algorithm 
for the case that each W; is an integer. [Hint: See exercise 4.6.1—26.] 

14. [HM30] Fill in the details of Brent and Traub’s method for calculating U™ (z) 
when U(z) = z + Up z" +---, using (27) and (28). 

15. [HM20] For what functions U(z) does V(z) have the simple form z” in (27)? 
What do you deduce about the iterates of U(z)? 


16. [HM21] Let W(z) = G(t) as in exercise 8. The “obvious” way to find the 
coefficients W1, W2, W3, ... is to proceed as follows: Set n + 1 and Ri(t) « G(t). 
Then preserve the relation Wn V (t) + Wn+1V (t)? +-+- = Rn(t) by repeatedly setting 
Wn < [t] Rn (t)/Mi, Rnti(t) <— Rn(t)/V(t) —Wn, ne n+l. 

Prove Lagrange’s formula of exercise 8 by showing that 


Leet RL O/A" = L I ROeh VO, foralln>1andk>1. 
n n 
17. [M20] Given the power series V(z) = Viz + Voz? +V3z3+---, we define the power 
matriz of V as the infinite array of coefficients Uny = 7 [2"]V(z)*; the nth poweroid 
of V is then defined to be Vn(x) = Uno + Uni® +--+: + Unng”. Prove that poweroids 


satisfy the general convolution law 
n 
Vale + 9) = 0(j) Vile) Vart). 
k 


(For example, when V(z) = z we have V,(x) = x”, and this is the binomial theorem. 
When V(z) = In(1/(1 — z)) we have unk = [?] by Eq. 1.2.9-(26); hence the poweroid 
V,(a) is £”, and the identity is the result proved in exercise 1.2.6-33. When V(z) = 
e* — 1 we have V,(x) = >, {7}2* and the formula is equivalent to 


en ae 
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an identity we haven’t seen before. Several other triangular arrays of coefficients that 
arise in combinatorial mathematics and the analysis of algorithms also turn out to be 
the power matrices of power series.) 


18. [HM22] Continuing exercise 17, prove that poweroids also satisfy 
Vale +y) = (e+) (Py )Vele)Vn-nly). 
k 

[Hint: Consider the derivative of e”Y ©] 
19. [M25] Continuing exercise 17, express all the numbers vnk in terms of the numbers 
Un = Uni = N! Vn of the first column, and find a simple recurrence by which all columns 
can be computed from the sequence v1, v2, .... Show in particular that if all the un 
are integers, then all the Vnk are integers. 
20. [HM20] Continuing exercise 17, suppose we have W(z) = U(V(z)) and Uo = 0. 
Prove that the power matrix of W is the product of the power matrices of V and U: 
Wnk = a UnjUjk- 

> 21. [HM27| Continuing the previous exercises, suppose Vi # 0 and let W(z) = 
-V1 (— z2). The purpose of this exercise is to show that the power matrices of V 
and W are “dual” to each other; for example, when V(z) = In(1/(1 — z)) we have 
viz) = 1 — e77, W(z) = e — 1, and the corresponding power matrices are the 
well-known Stirling triangles ung = Ms a, Wnk = g? 2 


a) Prove that the inversion formulas 1.2.6-(47) for Stirling numbers hold in general: 


2, UnkWkm(—1)"—* = Swanton =i" = Ômn : 
k 


b) The relation Up(n—%) = n®[z*](V(z)/z)”* shows that, for fixed k, the quantity 
Un(n—k) /V;" is a polynomial in n of degree < 2k. We can therefore define 


Va(a- k) = =a" Ea ] (V (z )/z)~-* 
for arbitrary a when k is a nonnegative integer, as we did for Stirling numbers in 
Section 1.2.6. Prove that v—x)(-n) = Wnk- (This generalizes Eq. 1.2.6-(58).) 
> 22. [HM27] Given U(z) = Uo t+ Uiz+U22? +--+ with Uo Æ 0, the ath induced function 
Ut@}(z) is the power series V(z) defined implicitly by the equation 
V(z) = U(zV(z)°). 
a) Prove that UT (2) = U(z) and Ut*}{9}(z) = Ule+8}(z), 
b) Let B(z) be the simple binomial series 1+ 2. Where have we seen B1” (2) before? 
c) Prove that [2z"]Ut?+(z)” = aa? | U(z)**"*. Hint: If W(z) = z/U(z)%, we 
have U‘*#(z) = (WEH (z)/z)¥*. 
d) Consequently any poweroid V,(x) satisfies not only the identities of exercises 17 
and 18, but also 


(c+y)Valatytna) _ E) tVe(@ + ka) yVn—e(y + (n — kja) | 
r+y+na k z+ ka yt+(n—k)a , 


k 


Vala +y) _ 5 o Vn—k(y — ka) 
y—na =w k-1/ r+ka y—-ka ` 


k 


[Special cases include Abel’s binomial theorem, Eq. 1.2.6-(16); Rothe’s identities 
1.2.6-(26) and 1.2.6-(30); Torelli’s sum, exercise 1.2.6-34.] 
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23. [HM35] (E. Jabotinsky.) Continuing in the same vein, suppose that U = (unx) is 
the power matrix of U (z) = z + U22? +--+. Let un = Uni = n! Un. 


a) Explain how to compute a matrix lnU so that the power matrix of Ul" (z) is 
exp(alnU) = I +alU + (aln U)?/2 +.. 
b) Let lnx be the entry in row n and column k of InU, and let 
2 3 4 
z z z 
lyn = lri; L(z)=he t bz t u7 A 
Prove that Ink = (,",)lngi-k for 1 < k < n. [Hint: U (2) = z + eL(z) + O(@®)] 
c) Considering U'*!(z) as a function of both a and z, prove that 
0) 
ða Oz 
(Consequently L(z) = (lk/k!)V (z), where V(z) is the function in (27) and (28).) 
d) Show that if u2 4 0, the numbers ln can be computed from the recurrence 


n n n 
l2 = u2, D (j,) letn tik = 5 lkUnk - 
k=2 


k=2 


UPI) = Lz) ZUe) = L(U()). 


How would you use this recurrence when u2 = 0? 
e) Prove the identity 


ED SE tees 
Un = — kılk km 
a m! kı! ka! hal e ? 
m=0 kı+4 tkm=n+m—1 
k1,...km>2 
where nj = 1 + kı +- + kj- j. 


24. [HM25] Given the power series U (z) = U1z + U22? +---, where U; is not a root 
of unity, let U = (unķ) be the power matrix of U (z). 
a) Explain how to compute a matrix InU so that the power matrix of Ul” (z) is 
exp(alnU) = I +almU + (alnU)?/2!4 ++. 
b) Show that if W(z) is not identically zero and if U(W(z)) = W(U(z)), then W(z) = 
Ull(z) for some complex number a. 
25. [M24] If U(z) = z+ Ugz* + Ups zt +--+ and V(z) = z + Viz! + Viiz tt +, 
where k > 2,1 > 2, Uk 40, Vi £0, and U(V(z)) = V(U(z)), prove that we must have 
k = l and V(z) = Ulel(z) for a= Vi /Uk. 
26. [M22] Show that if U(z) = Uo + Uiz + U22? +--+ and V(z) = Viz + Voz? +- 
are power series with all coefficients 0 or 1, we can obtain the first N coefficients of 
U(V(z)) mod 2 in O(N'**) steps, for any € > 0. 
27. [M22] (D. Zeilberger.) Find a recurrence analogous to (9) for computing the 
coefficients of W(z) = V(z)V(qz)...V(q™ ‘z), given q, m, and the coefficients of 
V(z) =14Mz4 Voz" +--+. Assume that q is not a root of unity. 
> 28. [HM26] A Dirichlet series is a sum of the form V(z) = Vi /1*+V2/2*+V3/3*+---; 
the product U(z)V(z) of two such series is the Dirichlet series W(z) where 


Wn = >_UaVnsa- 


d\n 
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Ordinary power series are special cases of Dirichlet series, since we have Vo + Viz + 
Voz" + V32? +--+. = Vo/1° + Vi/2° + V2/4° + V3/85 +--+ when z = 27°. In fact, 
Dirichlet series are essentially equivalent to power series V(z1, z2,...) in arbitrarily 
many variables, where z, = p,° and pẹ is the kth prime number. 

Find recurrence relations that generalize (g) and the formulas of exercise 4, assum- 
ing that a Dirichlet series V (z) is given and that we want to calculate (a) W(z) = V(z)®* 
when Vi = 1; (b) W(z) = expV(z) when Vy = 0; (c) W(z) = InV(z) when Vi = 1. 
[Hint: Let t(n) be the total number of prime factors of n, including multiplicity, and let 
ô Xn Vn/n® = 3, t(n) Vn/n*. Show that ô is analogous to a derivative; for example, 
eY = eV) SV (z2).] 


It seems impossible that any thing 
should really alter the series of things, 
without the same power which first produced them. 


— EDWARD STILLINGFLEET, Origines Sacræ, 2:3:2 (1662) 


This business of series, the most disagreeable thing in mathematics, 
is no more than a game for the English; 
Stirling's book, and the one by de Moivre, are proof. 


— PIERRE DE MAUPERTUIS, letter to d’Ortous de Mairan (30 Oct 1730) 


He was daunted and bewildered by their almost infinite series. 
— G. K. CHESTERTON, The Man Who Was Thursday (1907) 


ANSWERS TO EXERCISES 


This branch of mathematics is the only one, I believe, 

in which good writers frequently get results entirely erroneous. 
It may be doubted if there is a single 

extensive treatise on probabilities in existence 

which does not contain solutions absolutely indefensible. 


— C. S. PEIRCE, in Popular Science Monthly (1878) 


NOTES ON THE EXERCISES 
1. An average problem for a mathematically inclined reader. 


3. (Solution by Roger Frye, after about 110 hours of computation on a Connection 
Machine in 1987.) 958004 + 217519* + 414560* = 4224814 and (therefore) 1916004 + 
435038* + 829120* = 844962". 


4. (One of the readers of the preliminary manuscript for this book reported that he 
had found a truly remarkable proof. But unfortunately the margin of his copy was too 
small to contain it.) 


SECTION 3.1 


1. (a) This will usually fail, since “round” telephone numbers are often selected by 
the telephone user when possible. In some communities, telephone numbers are perhaps 
assigned randomly. But it would be a mistake in any case to try to get several successive 
random numbers from the same page, since the same telephone number is often listed 
several times in a row. 

(b) But do you use the left-hand page or the right-hand page? Say, use the left- 
hand page number, divide by 2, and take the units digit. The total number of pages 
should be a multiple of 20; but even so, this method will have some bias. 

(c) The markings on the faces will slightly bias the die, but for practical purposes 
this method is quite satisfactory (and it has been used by the author in the preparation 
of several examples in this set of books). See Math. Comp. 15 (1961), 94-95, for further 
discussion of icosahedral dice. 

(d) (This is a hard question thrown in purposely as a surprise.) The number is 
not quite uniformly random. If the average number of emissions per minute is m, the 
probability that the counter registers k is e~”m*/k! (the Poisson distribution); so the 
digit 0 is selected with probability e~”)<x>0m1*/(10k)!, etc. In particular, the units 
digit will be even with probability e~ ™ coshm = 4 + te", and this is never equal 
to i (although the error is negligibly small when m is large). 
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It is almost legitimate to take ten readings (mo, ..., mg) and then to output j if 
my, is strictly less than m; for all i 4 j; try again if the minimum value appears more 
than once. (See (h).) However, the parameter m isn’t really constant in the real world. 

(e) Okay, provided that the time since the previous digit selected in this way is 
random. However, there is possible bias in borderline cases. 

(f,g) No. People usually think of certain digits (like 7) with higher probability. 

(h) Okay; your assignment of numbers to the horses had probability ii of assigning 
a given digit to the winning horse (unless you know, say, the jockey). 


2. The number of such sequences is the multinomial coefficient 1000000! / (100000!) +°; 
the probability is this number divided by 101000000. the total number of sequences of 
a million digits. By Stirling’s approximation we find that the probability is close to 
1/16710? V27) © 2.56 x 10778, roughly one chance in 4 x 107°. 


3. 3040504030. 


4. (a) Step K11 can be entered only from step K10 or step K2, and in either case we 
find it impossible for X to be zero by a simple argument. If X could be zero at that 
point, the algorithm would not terminate. 

(b) If X is initially 3830951656, the computation is like many of the steps that 
appear in Table 1 except that we reach step K11 with Y = 3 instead of Y = 5; hence 
3830951656 — 5870802097. Similarly, 5870802097 — 1226919902 — 3172562687 > 
3319967479 — 6065038420 — 6065038420 — ---. 


5. Since only 10!° ten-digit numbers are possible, some value of X must be repeated 
during the first 10'°+1 steps; and as soon as a value is repeated, the sequence continues 
to repeat its past behavior. 


6. (a) Arguing as in the previous exercise, the sequence must eventually repeat a 
value; let this repetition occur for the first time at step u + A, where X,4, = Xu. 
(This condition defines 4 and A.) We have 0 < w<m,0<A<m, wt+tA<m. The 
values u = 0, A = mare attained if and only if f is a cyclic permutation; and pp = m—1, 
à = 1 occurs, e.g., if Xo = 0, f(z) =x +1 for «<< m-—1,and f(m—-—1)=m-1. 

(b) We have, for r > n, X, = Xn if and only if r—n is a multiple of A and n > p. 
Hence X2, = Xn if and only if n is a multiple of A and n > u. The desired results now 
follow immediately. [Note: Equivalently, the powers of an element in a finite semigroup 
include a unique idempotent element: Take Xı =a, f(x) = ax. See G. Frobenius, 
Sitzungsberichte preußische Akademie der Wissenschaften (1895), 82—83.] 

c) Once n has been found, generate X; and Xn+: for i > 0 until first finding 
Xi = Xn+i; then u = i. If none of the values of Xn+; for 0 < i < u is equal to Xn, it 
follows that A = n, otherwise A is the smallest such i. 

7. (a) The least n > 0 such that n — (€(n) — 1) is a multiple of A and (n) —1> u 

is n = Ql'smax(u+1.)1 14 A. [This may be compared with the least n > 0 such that 
Xon = Xn, namely A([u/A] + duo)-] 
b) Start with X = Y = Xo, k = m = 1. (At key places in this algorithm we will 
have X = Xəm-k-1, Y = Xm-1, and m = £(2m—k).) To generate the next random 
number, do the following steps: Set X + f(X) and k + k — 1. If X = Y, stop (the 
period length A is equal to m — k). Otherwise if k = 0, set Y + X, m+ 2m, k + m. 
Output X. 

Notes: Brent has also considered a more general method in which the successive 
values of Y = Xn, satisfy nı = 0, ni+ı = 1 + |pni] where p is any number greater 
than 1. He showed that the best choice of p, approximately 2.4771, saves about 3 
percent of the iterations by comparison with p = 2. (See exercise 4.5.4—4.) 
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The method in part (b) has a serious deficiency, however, since it might generate 
a lot of nonrandom numbers before shutting off. For example, we might have a 
particularly bad case such as À = 1, u = 2”. A method based on Floyd’s idea in 
exercise 6(b), namely one that maintains Y = X2, and X = Xn for n = 0, 1, 2, 
..., will require a few more function evaluations than Brent’s method, but it will stop 
before any number has been output twice. 

On the other hand, if f is unknown (for example, if we are receiving the values Xo, 
Xı, ... online from an outside source) or if f is difficult to apply, the following cycle 
detection algorithm due to R. W. Gosper will be preferable: Maintain an auxiliary 
table To, Ti, ..., Im, where m = |lgn| when receiving Xn. Initially To < Xo. For 
n = 1, 2, ..., compare Xn with each of To, ..., Tyign |; if no match is found, set 
Tein) <— Xn, where e(n) = p(n+1) = max{e | 2° divides n +1}. But if a match 
Xn = Tk is found, then À = n — max{ l | l < n and e(l) = k}. After Xn has been stored 
in Te(n), it is subsequently compared with Xn+1, Xn+2, ---; Xnp2e(n)+1. Therefore the 
procedure stops immediately after generating X,,4,+;, where j > 0 is minimum with 
e(u+ j) > [IgA] —1. With this method, no X value is generated more than twice, and 
at most max(1, 218à1-1) values are generated more than once. [MIT AI Laboratory 
Memo 239 (29 February 1972), Hack 132.] 

R. Sedgewick, T. G. Szymanski, and A. C. Yao have analyzed a more complex 
algorithm based on parameters m > 2 and g > 1: An auxiliary table of size m contains 
Xo, Xv, ..., Xqp at the moment that Xn is computed, where b = ghenr/m) and q = 
[n/b] — 1. If nmod gb < b, Xn is compared to the entries in the table; eventually 
equality occurs, and we can reconstruct u and À after doing at most (g + 1)2!is+A)|+1 
further evaluations of f. If the evaluation of f costs 7 units of time, and if testing 
Xn for membership in the table costs ø units, then g can be chosen so that the total 
running time is (u + A)(T + O()7/?); this is optimum if ¢/t = O(m). Moreover, Xn 
is not computed unless u+A > mn/(m+4g+2), so we can use this method “online” to 
output elements that are guaranteed to be distinct, making only 24+O0(m-/ 2) function 
evaluations per output. [SICOMP 11 (1982), 376-390.] 


8. (a,b) 00,00,... [62 starting values]; 10,10,... [19]; 60,60,... [15]; 50,50,... [1]; 
24,57, 24,57,... [3]. (c) 42 or 69; these both lead to a set of fifteen distinct values, 
namely (42 or 69), 76, 77, 92, 46, 11, 12, 14, 19, 36, 29, 84, 05, 02, 00. 


9. Since X < b”, we have X? < b?", and the middle square is | X?/b”| < X?/b”. If 
X >0, then X?/b” < Xb"/b” = X. 


10. If X = ab”, the next number of the sequence has the same form; it is equal to 
(a? mod b”)b”. If a is a multiple of all the prime factors of b, the sequence will soon 
degenerate to zero; if not, the sequence will degenerate into a cycle of numbers having 
the same general form as X. 

Further facts about the middle-square method have been found by B. Jansson, 
Random Number Generators (Stockholm: Almqvist & Wiksell, 1966), Section 3A. 
Numerologists will be interested to learn that the number 3792 is self-reproducing in 
the four-digit middle-square method, since 3792? = 14379264; furthermore (as Jansson 
observed), it is “self-reproducing” in another sense, too, since its prime factorization is 
3-79-24! 


11. The probability that u = 0 and A = 1 is the probability that Xı = Xo, namely 
1/m. The probability that (u, A) = (1,1) or that (u, A) = (0,2) is the probability that 
X, # Xo and that Xə has a certain value, so it is (1 — 1/m)(1/m). Similarly, the 
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probability that the sequence has any given pu and A is a function only of w+ A, namely 


Pp, A) = = Il (1- =). 


m 


where Q(m) is defined in Section 1.2.11.3, Eq. (2). By Eq. (25) in that section, 
the probability is approximately \/7/2m = 1.25/,/m. The chance of Algorithm K 
converging as it did is only about one in 80000; the author was decidedly unlucky. But 
see exercise 15 for further comments on the “colossalness.” 


12. XO AP(u,d) 1 (1+3(1 +) +6(1 SE 2) 4) = R 


1<A<m 
O0<p<m 


(See the previous answer. In general if f(ao,a1,...) = )on>0 an []_, (1 — k/m) then 
f(ao,ai,...) = ao + f(a1,a2,...) — f(ai,2a2,...)/m; apply this identity with an = 
(n + 1)/2.) Therefore the average value of \ (and, by symmetry of P(w, A), also of 
u + 1) is approximately \/am/8 + z, The average value of u + 2 is exactly Q(m), 
approximately \/am/2 — $. [For alternative derivations and further results, including 
asymptotic values for the moments, see A. Rapoport, Bull. Math. Biophysics 10 (1948), 
145-157, and B. Harris, Annals Math. Stat. 31 (1960), 1045-1062; see also I. M. 
Sobol, Theory of Probability and Its Applications 9 (1964), 333-338. Sobol discusses 
the asymptotic period length for the more general sequence Xn+1 = f(Xn) if n #0 
(modulo m), Xn+1 = g(Xn) if n = 0 (modulo m), with both f and g random.] 


13. [Paul Purdom and John Williams, Trans. Amer. Math. Soc. 133 (1968), 547-551.] 
Let Tmn be the number of functions that have n one-cycles and no cycles of length 


greater than one. Then 
ce — ') n 
if m : 


n—-1 

(This is ("")r(m,m—n) in exercise 2.3.4.4-25.) Any function is such a function followed 
by a permutation of the n elements that were the one-cycles. Hence ast Tmnn! = 
m™ 

Let Pak be the number of permutations of n elements in which the longest cycle 
is of length k. Then the number of functions with a maximum cycle of length k is 
Yn>1 LmnPnk- To get the average value of k, we compute } s1 0,51 KPmnPnk, which 
by the result of exercise 1.3.3-23 is > 5, Tmn n!(en + $¢+ O(n™)) where c ~ .62433. 


Summing, we get the average value cQ(m) + 4c + O(m'/?). (This is not substantially 
larger than the average value when Xo is selected at random. The average value of 
max jt is asymptotic to Q(m) In4, and the average value of max(u + A) is asymptotic 
to 1.9268Q(m); see Flajolet and Odlyzko, Lecture Notes in Comp. Sci. 434 (1990), 
329-354.) 

14. Let c,(m) be the number of functions with exactly r different final cycles. From 
the recurrence ci(m) = (m — 1)! — Epso (7) (—1)*(m — k)"cı(m — k), which comes 
by counting the number of functions whose image contains at most m — k elements, 
we find the solution ci(m) = m™~'Q(m). (See exercise 1.2.11.3-16.) Another way 


542 ANSWERS TO EXERCISES 3.1 


to obtain the value of cı(m), which is perhaps more elegant and revealing, is given in 
exercise 2.3.4.4-17. The value of c;(m) may be determined as in exercise 13: 


coon) = Sot] =m] alta tale) eet) 


The desired average value can now be computed; it is (see exercise 12) 


E,, = 1 (in oH, a! 3H, maim? =) 
m m m 


m 
zg a Tomoa 
2 m 3 m m 
This latter formula was obtained by quite different means by Martin D. Kruskal, AMM 
61 (1954), 392-397. Using the integral representation 


Em = | (G "yn 1) gee 

0 m x 

he proved the asymptotic relation limm-—oo(Em zn m) = (7 + In2). For further 
results and references, see John Riordan, Annals Math. Stat. 33 (1962), 178-185. 


15. The probability that f(a) A x for all z is (m—1)™/m"™, which is approximately 1/e. 
The existence of a self-repeating value in an algorithm like Algorithm K is therefore 
not “colossal” at all—it occurs with probability 1 — 1/e ~ .63212. The only “colossal” 
thing was that the author happened to hit such a value when Xo was chosen at random 
(see exercise 11). 


16. The sequence will repeat when a pair of successive elements occurs for the second 
time. The maximum period is m?. (See the next exercise.) 


17. After selecting Xo,...,Xk-1 arbitrarily, let Xn41 = f(Xn,...,Xn—e+1), where 
0 < @1,...,%% < M implies that 0 < f(z1,...,£k) < m. The maximum period is 
m”. This is an obvious upper bound, but it is not obvious that it can be attained; for 
constructive proofs that it can always be attained for suitable f, see exercises 3.2.2—-17 
and 3.2.2—21, and for the number of ways to attain it see exercise 2.3.4.2—23. 


18. Same as exercise 7, but use the k-tuple of elements (Xn,..., Xn—r+1) in place of 
the single element Xn. 


19. Clearly Pr(no final cycle has length 1) = (m—1)"/m™. R. Pemantle [J. Algorithms 
54 (2005), 72-84] has shown that Pr(à = 1) = @(m*/), and that Pr((u + A)? > 2m*a 
and A/(w+A) < y) rapidly approaches ye”, when z > 0,0 < y < 1, and m > oo. 
The k-dimensional analogs of exercises 13 and 14 remain unsolved. 


20. It suffices to consider the simpler mapping g(X) defined by steps K2-K13. Work- 
ing backward from 6065038420, we obtain a total of 597 solutions; the smallest is 
0009612809 and the largest is 9995371004. 


21. We may work with g(X) as in the previous exercise, but now we want to run the 
function forward instead of backward. There is an interesting tradeoff between time 
and space. Notice that the mechanism of step K1 tends to make the period length 
small. So does the existence of X’s with large in-degree; for example, 512 choices of 
X = x6*x*xxxk*** in step K2 will go to K10 with X «+ 0500000000. 

Scott Fluhrer has discovered another fixed point of Algorithm K, namely the value 
5008502835(!). He also found the 3-cycle 0225923640 — 2811514413 — 0590051662 > 
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0225923640, making a total of seven cycles in all. Only 128 starting numbers lead to 
the repeating value 5008502835. Algorithm K is a terrible random number generator. 


22. If f were truly random, this would be ideal; but how do we construct such f? The 
function defined by Algorithm K would work much better under this scheme, although 
it does have decidedly nonrandom properties (see the previous answer). 


23. The function f permutes its cyclic elements; let (xo, ...,£ķ—-1) be the “unusual” 
representation of the inverse of that permutation. Then proceed to define £k, ...,Em—1 
as in exercise 2.3.4.4-18. [See J. Combinatorial Theory 8 (1970), 361-375.] 

For example, if m = 10 and (f(0),..., f(9)) = (3,1,4,1,5,9,2,6,5,4), we have 
(to,..-, £9) = (4,9, 5,1, 1,3, 4, 2,6,5); if (£0,..., £9) = (3,1,4,1,5,9, 2,6,5,4), we have 
(f(0),---,f(9)) = (6, 4, 9, 3,1, 1,2,5,4,5). 


SECTION 3.2.1 
1. Take Xo even, a even, c odd. Then X,, is odd for n > 0. 


2. Let X, be the first repeated value in the sequence. If X, were equal to X, for some 
k where 0 < k < r, we could prove that X;-1 = X,z_1, since Xn uniquely determines 
Xn-1 when a is prime to m. Hence k = 0. 


3. If d is the greatest common divisor of a and m, the quantity aX, can take on at 
most m/d values. The situation can be even worse; for example, if m = 2° and if a is 
even, Eq. (6) shows that the sequence is eventually constant. 


4. Induction on k. 
5. If a is relatively prime to m, there is a number a’ for which aa’ = 1 (modulo m). 
Then X,-1 = (a'Xn — a'c) mod m; and in general, if b= a — 1, 
Xn-r = ((a')* Xn — c(a’ +--- + (a’)*)) mod m 
= ((a’)'Xn ‘(ia = 1)c/b) mod m 


when k > 0, n— k > 0. If a is not relatively prime to m, it is not possible to determine 
Xn-ı when Xn is given; multiples of m/gcd(a,m) may be added to Xn-ı without 
changing the value of Xn. (See also exercise 3.2.1.3-7.) 


SECTION 3.2.1.1 


1. Let c’ be a solution to the congruence ac’ = c (modulo m). (Thus, c’ = a'c mod m, 
if a’ is the number in the answer to exercise 3.2.1-5.) Then we have 


LDA X; ADD CPRIME; MUL A. 


Overflow is possible on this addition operation. (From results derived later in the 
chapter, it is probably best to save a unit of time, taking c = a and replacing the ADD 
instruction by ‘INCA 1’. Then if Xo = 0, overflow will not occur until the end of the 
period, so it won’t occur in practice.) 


2. RANDM STJ 1F 1H JNOV * 
LDA XRAND JMP *-1 
MUL 2F XRAND CON Xo 
SLAX 5 2H CON 
ADD 3F (or, INCA çc, if c is small) 3H CON c J 


STA XRAND 
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3. Let a’ = aw mod m, and let m’ be such that mm’ = 1 (modulo w). Set y + 
lomult(a’,x), z + himult(a’,x), t + lomult(m’,y), u + himult(m,t). Then we have 
mt = a'x (modulo w), hence a’x — mt = (z — u)w, hence az = z — u (modulo m); it 
follows that ax mod m = z — u + |z < uļm. 


4. Define the operation z mod 2° = y if and only if x = y (modulo 2°) and —2°7} < 
y < 2°71, The congruential sequence (Yp) defined by 


Yo = Xo mod 2”, Yni1i = (aYn +c) mod 2°? 


is easy to compute on 370-style machines, since the lower half of the product of y 
and z is (yz) mod 232 for all two’s complement numbers y and z, and since addition 
ignoring overflow also delivers its result mod 2??. This sequence has all the random- 
ness properties of the standard linear congruential sequence (Xn), since Yp, = Xn 
(modulo 74). Indeed, the two’s complement representation of Yn is identical to the 
binary representation of Xn, for all n. [G. Marsaglia and T. A. Bray first pointed this 
out in CACM 11 (1968), 757—759.] 


5. (a) Subtraction: LDA X; SUB Y; JANN *+2; ADD M. (b) Addition: LDA X; SUB M; 
ADD Y; JANN *+2; ADD M. (Note that if m is more than half the word size, the 
instruction ‘SUB M’ must precede the instruction ‘ADD Y’.) 


6. The sequences are not essentially different, since adding the constant (m — c) has 
the same effect as subtracting the constant c. The operation must be combined with 
multiplication, so a subtractive process has little merit over the additive one (at least 
in MIX’s case), except when it is necessary to avoid affecting the overflow toggle. 


7. The prime factors of z” — 1 appear in the factorization of z*" — 1. If r is odd, 
the prime factors of z* + 1 appear in the factorization of z*" +1. And z?* — 1 equals 
(z* —1)(z* + 1). 

8. JOV +*+1 (Ensure that overflow is off.) 


LDA X 
MUL A 
STX TEMP 


ADD TEMP Add lower half to upper half. 
JNOV *+2 If >w, subtract w -— 1. 
INCA 1 (Overflow is impossible in this step.) J 


Note: Since addition on an e-bit ones’-complement computer is mod (2° — 1), it is 
possible to combine the techniques of exercises 4 and 8, producing (yz) mod (2° — 1) 
by adding together the two e-bit halves of the product yz, for all ones’ complement 
numbers y and z regardless of sign. 


9. (a) Both sides equal ag|x/q|. 

(b) Set t + a(x mod q) — r|x/q|, where r = m mod a; the constants q and r can 
be precomputed. Then az mod m = t + [t<0]m, because we can prove that t > —m: 
Clearly a(x mod q) < a(q — 1) < m. Also r|a/q| < r|(m— 1)/q] =rlat(r—-1)/¢| = 
ra <qa<mif0 <r < q; and a? < m implies r <a < q. [This technique is implicit in 
a program published by B. A. Wichmann and I. D. Hill, Applied Stat. 31 (1982), 190.] 
10. Ifr > qand z = m—1 we have r|x/q| > (q+1)(a+1) > m. So the condition r < q 
is necessary and sufficient for method 9(b) to be valid; this means = -l<a< me Let 
t= |Vm J. The intervals [%*—1.. 7] are disjoint for 1 < q < t, and they include exactly 
1 or 2 integers, depending on whether q is a divisor of m. These intervals account for 
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all solutions with a > ym; they also include the case a = t, if (,/mmod1) < Z, and 
the case a = t — 1 if m = t. Thus the total number of “lucky” multipliers is exactly 


2|/m |+ |d(m)/2] —[(vm mod 1) < į] —1, where d(m) is the number of divisors of m. 


11. We can assume that a < im; otherwise we can obtain az mod m from (m — a)x 
mod m. Then we can represent a = a'a” — a", where a’, a”, and a’” are all less than 
ym; for example, we can take a’ ~ ym-— 1 and a” = fa/a']. It follows that ax mod m 
is (a' (ax mod m) mod m — (ax mod m)) mod m, and the inner three operations can 
all be handled by exercise 9. 

When m = 2°! — 1 we can take advantage of the fact that m — 1 has 192 divisors 
to find cases in which m = q'a’ + 1, simplifying the general method because r’ = 1. It 
turns out that 86 of these divisors lead to lucky a” and a”, when a = 62089911; the 
best such case is probably a’ = 3641, a” = 17053, a” = 62, because 3641 and 62 both 


divide m — 1. This decomposition yields the scheme 


t +— 17053(x mod 125929) — 16410|2/125929| , 
t < 3641(t mod 589806) — |t/589806| , 
t + t — (62(x mod 34636833) — |x/34636833]) , 


where “—” denotes subtraction mod m. The mod operations count as one multiplication 
and one subtraction, because z mod q = x—q|x/q] and the operation |x/q| has already 
been done; thus, we have performed seven multiplications, three divisions, and seven 
subtractions. But it’s even better to notice that 62089911 itself has 24 divisors; they 
lead to five suitable factorizations with a” = 0. For example, when a’ = 883 and 
a” = 70317 we need only six multiplications, two divisions, four subtractions: 


t + 883(2 mod 2432031) — 274|2/2432031| , 
t + 70317(t mod 30540) — 2467 |t/30540] . 


[Can the worst-case number of multiplications plus divisions be reduced to at most 11, 
for all a and m, or is 12 the best upper bound? Another way to achieve 12 appears in 
exercise 4.3.3-19.] 


12. (a) Let m = 9999998999 = 10'° — 10? — 1. To multiply (xorg...20)10 by 10 
modulo m, use the fact that 10/°x9 = 10°9 + £9: Add (x9000)1i0 to (agx7...X0%9)10. 
And to avoid circular shifting, imagine that the digits are arranged on a wheel: Just 
add the high-order digit x9 to the digit x2 three positions left, and point to xg as 
the new high-order digit. If 9 + x2 > 10, a carry propagates to the left. And if 
this carry ripples all the way to the left of xg, it propagates not only to x9 but also 
to the x2 position; it may continue to propagate from both z9 and x2 before finally 
settling down. (The numbers might also become slightly larger than m. For example, 
0999999900 goes to 9999999000 = m + 1, which goes to 9999999009 = m+ 10. But a 
redundant representation isn’t necessarily harmful.) 

(b) This is the operation of dividing by 10, so we do the opposite of (a): Move 
the high-order digit pointer cyclically left, and subtract the new high-order digit from 
the digit three to its left. If the result of subtraction is negative, “borrow” in the usual 
fashion (Algorithm 4.3.15); that is, decrease the preceding digit by 1. Borrowing may 
propagate as in (a), but never past the high-order digit position. This operation keeps 
the numbers nonnegative and less than m. (Thus, division by 10 turns out to be easier 
than multiplication by 10.) 

(c) We can remember the borrow-bit instead of propagating it, because it can be 
incorporated into the subtraction on the next step. Thus, if we define digits £n and 
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borrow-bits b, by the recurrence 


Tn = (£n—10 In-3 bn) mod 10= Ln—10 In-—3 bn t 10bn+1, 
we have 999999900” mod 9999998999 = Xn by induction on n, where 


Xn = (tn 1Tn—2Ln—3Ln—4Ln—5Ln—6Ln—7TTn4+2Tn+4 1£n)10 = 1000bn+3 


= (Un-1¥n-2---Ln—10)10 — (Ln—-1¥n—2Ln—3)10 — bn, 


provided that the initial conditions are set up to make Xo = 1. Notice that 10Xn41 = 
(Un ¥n-1Fn—-2En—3En—4En—5En—6En4+3Ln4+2F%n+410)10—10000bn+4 = MIn+Xn; it follows 
that 0 < Xn < m for all n > 0. 

(d) If0 < U < m, the first digit of the decimal representation of U/m is |10U/m|], 
and the subsequent digits are the decimal representation of (10U mod m)/m; see, for 
example, Method 2a in Section 4.4. Thus U/m = (.u1u2...)10 if we set Up = U and 
Un = 10Un-1 mod m = 10U,-1 — mun. Informally, the digits of 1/m are the leading 
digits of 10” mod m for n = 1, 2, ..., a sequence that is eventually periodic; these are 
the leading digits of 107” mod m in reverse order, so we have calculated them in (c). 

A rigorous proof is, of course, preferable to handwaving. Let À be the least positive 
integer with 10* = 1 (modulo m), and define £n = Zn mod A; bn = bn moar, Xn = 
Xn mod à for all n < 0. Then the recurrences for £n, bn, and Xn in (c) are valid for all 
integers n. If Up = 1 it follows that Un = X_, and Un = £-n; hence 


999999900” mod 9999998999 _ 
9999998999 


(e) Let w be the computer’s word size, and use the recurrence 


(Bn 1Tn—2Tn 3203) 10% 


In = (Xn—k — Ln—1 — bn) mod w = n-k — Tn—i — bn + wbn+1, 


where 0 < l < k and k is large. Then (.@,-1%n-2%n-3---)w = Xn/m, where m = 


w? —w!—1and Xn41 = (wk? — w!) Xn mod m. The relation 


Xn = (Gr=a os Cink) i (Tni tee Ln—l)w m bn 


holds for n > 0; the values of r-1, ..., xx, and bo should be such that 0 < Xo < m. 
Such random number generators, and the similar ones in the following exercise, 
were introduced by G. Marsaglia and A. Zaman [Annals of Applied Probability 1 
(1991), 462-480], who called the method subtract-with-borrow. Their starting point 
was the radix-w representation of fractions with denominator m. The relation to linear 
congruential sequences was noticed by Shu Tezuka, and analyzed in detail by Tezuka, 
L’Ecuyer, and Couture [ACM Trans. Modeling and Computer Simulation 3 (1993), 
315-331]. The period length is discussed in exercise 3.2.1.2—22. 
13. Multiplication by 10 now requires negating the digit that is added. For this 
purpose it is convenient to represent a number with its last three digits negated; 
for example, 9876543210 = (9876544790)19._ Then 10 times (a9...x%3%2%1%0)10 is 
(ag Ree x3x'%1L0X9)10 where x’ = £o — T2. Similarly, (x9 tel X3%2%1Z0)10 divided by 10 
is (wor9 ... 4%" %271)10 where x” = xo — x3. The recurrence 


Tn = (En-3 Tn—10 bn) mod 10 = Tn-3 Tn—10 bn t 10bn+1 


yields 8999999101” mod 9999999001 = Xn where 


Xn = (In 1Tn—2Ln—3Ln—4Ln—5Ln—6Tn 7En42En41Fn)10 + 1000bn+3 


= (Ln-1¥n—-2---Ln—10)10 — (Ln—1¥n—2Ln—3)10 + bn. 
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When the radix is generalized from 10 to w, we find that the inverse powers of w 
modulo w” — w' + 1 are generated by 


Tn = (Ln—1 — n-k — bn) mod w = Tn-l n-k bn t Whn+1 


(the same as in exercise 12 but with k and l interchanged). 


14. Mild generalization: We can effectively divide by b modulo b* — b! +1 for any b 
less than or equal to the word size w, since the recurrence for £n is almost as efficient 
when b < w as it is when b= w. 

Strong generalization: The recurrence 


Q1%n—-1 + °° + Arin—-k + =| 


En = (a1En-1 +++ + akEn-k + Cn) modb, Cn41 = | b 


is equivalent to Xn = b-1X,_1 mod |m] in the sense that X,,/|m| = (.an—1%n_2---)o; 
if we define 


k 
m= ab" +---+a1b—1 and Xn= (Eosten ten] (sign m). 
j=l 


The initial values x_1...2~% and co should be selected so that 0 < Xo < |m|; we will 
then have £n = (bXn41 — Xn)/|m| for n > 0. The values of x; for j < 0 that appear 
in the formula Xn/|m| = (.%n-1%n—-2-..)» are properly regarded as £j moa à, where 
psi (modulo m); these values may differ from the numbers z—1, ..., £—k that were 
initially supplied. The carry digits cn will satisfy 


k k 
S min(0, a;) < tn < X max(0, aj) 


j=l g=l 


if the initial carry co is in this range. 

The special case m = b" + b! — 1, for which aj = 631+ Ôjk, is of particular interest 
because it can be computed so easily; Marsaglia and Zaman called this the add-with- 
carry generator: 


Ln = (n-i + n-k + Cn) mod b = an—-1 + Ln—-k + Cn — b Cn41- 


Another potentially attractive possibility is to use k = 2 in a generator with, say, 
b = 234 and m = 654306? + b — 1. This modulus m is prime, and the period length 
turns out to be (m — 1)/2. The spectral test of Section 3.3.4 indicates that the spacing 
between planes is good (large v values), although of course the multiplier b~* is poor 
in comparison with other multipliers for this particular modulus m. 

Exercise 3.2.1.2-22 contains additional information about subtract-with-borrow 
and add-with-carry moduli that lead to extremely long periods. 


SECTION 3.2.1.2 
1. Period length m, by Theorem A. (See exercise 3.) 


2. Yes, these conditions imply the conditions in Theorem A, since the only prime 
divisor of 2° is 2, and any odd number is relatively prime to 2°. (In fact, the conditions 
of the exercise are necessary and sufficient.) 


3. By Theorem A, we need a = 1 (modulo 4) and a = 1 (modulo 5). By Law D of 
Section 1.2.4, this is equivalent to a = 1 (modulo 20). 
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4. We know Xye-1 = 0 (modulo 2°~') by using Theorem A in the case m = 2°71, 
Also using Theorem A for m = 2°, we know that Xjce-1 Æ 0 (modulo 2°). It follows 
that Xye-1 = 2°71, More generally, we can use Eq. 3.2.1-(6) to prove that the second 
half of the period is essentially like the first half, since X,,,9¢-1 = (Xn + 2°) mod 2°. 
(The quarters are similar too, see exercise 21.) 

5. We need a = 1 (modulo p) for p = 3, 11, 43, 281, 86171. By Law D of Section 1.2.4, 
this is equivalent to a = 1 (modulo 3-11-43 - 281 - 86171), so the only solution is the 
terrible multiplier a = 1. 

6. (See the previous exercise.) The congruence a = 1 (modulo 3-7- 11-13-37) 
implies that the solutions area = 1 + 111111k, for 0 < k <8. 

7. Using the notation of the proof of Lemma Q, u is the smallest value such that 
Xu+a = Xy; so it is the smallest value such that Y,4, = Y, and Z,4. = Zy. This 
shows that u = max(1,...,4). The highest achievable u is max(e1,..., et), but 
nobody really wants to achieve it. 

8. We have a? = 1 (modulo 8); so af = 1 (modulo 16), aë = 1 (modulo 32), etc. If 
a mod 4 = 3, then a—1 is twice an odd number; so (a2 *— 1)/(a—1) = 0 (modulo 2°) 
if and only if (a2 — 1)/2 = 0 (modulo 2°*1/2), which is true. 

9. Substitute for Xn in terms of Y,, and simplify. If Xo mod 4 = 3, the formulas 
of the exercise do not apply; but they do apply to the sequence Zn = (—Xn) mod 2°, 
which has essentially the same behavior. 


10. Only when m = 1, 2, 4, p°, and 2p*, for odd primes p. In all other cases, the result 
of Theorem B is an improvement over Euler’s theorem (exercise 1.2.4—28). 


11. (a) Either x+1 or «—1 (not both) will be a multiple of 4, so x1 = q2/, where q 
is odd and f is greater than 1. (b) In the given circumstances, f < e and so e > 3. We 
have +2 = 1 (modulo 2^) and +a # 1 (modulo 2/+1) and f > 1. Hence, by applying 
Lemma P, we find that (+r) 77" # 1 (modulo 2°), while a! = (+2)?°7 =1 
(modulo 2°). So the order is a divisor of 2°~/, but not a divisor of 2°-/~*. (c) 1 has 
order 1; 2° — 1 has order 2; the maximum period when e > 3 is therefore 2°~?, and for 
e > 4 it is necessary to have f = 2, that is, r = 4 + 1 (modulo 8). 


12. If k is a proper divisor of p — 1 and if aë? = 1 (modulo p), then by Lemma P 
we have a?) = 1 (modulo p*). Similarly, if a?~' = 1 (modulo p°), we find that 
aP-Dpe* = 4 (modulo p°). So in these cases a is not primitive. Conversely, if a?~* Æ 1 
(modulo p°), Theorem 1.2.4F and Lemma P tell us that aP=Dp°? # 1 (modulo p°), 
but a?7Dr = 1 (modulo p*). So the order is a divisor of (p — 1)p*~! but not of 
(p—1)p*~°; it therefore has the form kp*—', where k divides p— 1. But if a is primitive 


e-1 
modulo p, the congruence a*? b= 


13. Suppose a mod p ¥ 0, and let À be the order of a modulo p. By Theorem 1.2.4F, 
A is a divisor of p — 1. If A < p—1, then (p—1)/A has a prime factor, q. 

14. Let 0 < k< p. Ifa?~' =1 (modulo p°), then (a+ kp)?~* = a? + (p — 1)a?~7kp 
(modulo p°); and this is Æ 1, since (p — 1)a?~7k is not a multiple of p. By exercise 12, 
a+ kp is primitive modulo p°. 


=a (modulo p) implies that k = p — 1. 


15. (a) If A: = pft... pft and Ag = pf! ... pft, let kı = p’... p%' 


where 


hy h 
and K2 = pi"... Pit, 


gj =e; and hj =0, if e; < fj, 
9 =0 and hj=fj, if ej > fj. 
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Now ay! and aj? have periods 1/1 and A2/K2, and the latter are relatively prime. 
Furthermore (Ai /1)(A2/K2) = A, so it suffices to consider the case when A1 is relatively 
prime to Ae, that is, when A = AiA2. Now let à’ be the order of a1a2. Since (a1a2)* = 
1, we have 1 = (aia) ^ = aì ^t; hence A, is a multiple of Ay. This implies that A’ 
is a multiple of Az, since \1 is relatively prime to Az. Similarly, \’ is a multiple of 1; 
hence à’ is a multiple of X12. But obviously (a1az)*!*? = 1, so A’ = à1À2. 

(b) If a; has order A(m) and if az has order A, by part (a) A(m) must be a multiple 
of A, otherwise we could find an element of higher order, namely of order lem(A, A(m)). 
16. (a) f(z) = (x — a)(a™~1 + (a + c)s”? +- + (at +- + en-1)) + f(a). 
(b) The statement is clear when n = 0. If a is one root, f(x) = (x — a)q(x); therefore, 
if a’ is any other root, 


0 = f(a’) = (a' —a)q(a’), 
and since a’ — a is not a multiple of p, a’ must be a root of q(x). So if f(x) has more 
than n distinct roots, q(x) has more than n — 1 distinct roots. [J. L. Lagrange, Mém. 
Acad. Roy. Sci. Berlin 24 (1768), 181-250, §10.] (c) A(p) > p—1, since f(x) must have 
degree > p— 1 in order to possess so many roots. But A(p) < p— 1 by Theorem 1.2.4F. 


17. By Lemma P, 11° = 1 (modulo 25), 11° Æ 1 (modulo 125), etc.; so the order of 11 
is 5°71 (modulo 5°), not the maximum value \(5°) = 4 - 5°71. But by Lemma Q the 
total period length is the least common multiple of the period modulo 2° (namely ge-2) 
and the period modulo 5° (namely 5°~'), and this is 2°~75°-* = (10°). The period 
modulo 5° may be 5°} or 2-5°~' or 4-5°~', without affecting the length of period 
modulo 10°, since the least common multiple is taken. The values that are primitive 
modulo 5° are those congruent to 2, 3, 8, 12, 13, 17, 22, 23 modulo 25 (see exercise 12), 
namely 3, 13, 27, 37, 53, 67, 77, 83, 117, 123, 133, 147, 163, 173, 187, 197. 

18. According to Theorem C, amod8 must be 3 or 5. Knowing the period of a 
modulo 5 and modulo 25 allows us to apply Lemma P to determine admissible values 
of a mod 25. Period = 4 - 5°71: 2, 3, 8, 12, 13, 17, 22, 23; period = 2-5° 1: 4, 9, 
14, 19; period = 5°~': 6, 11, 16, 21. Each of these 16 values yields one value of a, 
0 < a < 200, with a mod 8 = 3, and another value of a with a mod 8 = 5. 

19. Several examples appear in lines 17—20 of Table 3.3.4-1. 

20. (a) We have AY, + Xo = AYn+iz + Xo (modulo m) if and only if Yn = Yn+k 
(modulo m’). (b)(i) Obvious. (ii) Theorem A. (iii) (a — 1)/(a — 1) = 0 (modulo 2°) 
if and only if a” = 1 (modulo 2°*"); if a  —1, the order of a modulo 2°*? is twice its 
order modulo 2°. (iv) (a” — 1)/(a — 1) = 0 (modulo pê) if and only if a” = 1. 

21. Xn4s = Xn + Xs by Eq. 3.2.1-(6); and s is a divisor of m, since s is a power of 
p when m is a power of p. Hence a given integer q is a multiple of m/s if and only if 
Xqs = 0, if and only if q is a multiple of m/gcd(X5,m). 


22. Algorithm 4.5.4P is able to test numbers of the form m = b +b! +1 for primality in 
a reasonable time when, say, b ~ 23? and l < k ~ 100; the calculations should be done 
in radix b so that the special form of m speeds up the operation of squaring mod m. 
(Consider, for example, squaring mod 9999998999 in decimal notation.) Algorithm 
4.5.4P should, of course, be used only when m is known to have no small divisors. 
Marsaglia and Zaman [Annals of Applied Probability 1 (1991), 474-475] showed 
that m = b? —b?? +1 is prime with primitive root b when b is the prime number 222 —5. 
This required factoring m—1 = b??(b—1)(b° +b° +b* +b? +b? +b+1)(b'4 +b" +1) in order 
to establish the primitivity of b; one of the 17 prime factors of m — 1 has 99 decimal 
digits. As a result, we can be sure that the sequence £n = (£n—22 —Un—43 — Cn) mod b = 
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Ln—22 — Ln—a3 — Cn + bCn41 has period length m — 1 ~ 10414 for every nonzero choice 
of seed values 0 < z—1,...,£—43 < b when co = 0. 

However, 43 is still a rather small value for k from the standpoint of the birthday 
spacings test (see Section 3.3.2J), and 22 is rather near 43/2. Considerations of 
“mixing” indicate that we prefer values of k and l for which the first few partial 
quotients in the continued fraction of l/k are small. To avoid potential problems with 
this generator, it’s a good idea to discard some of the numbers, as recommended by 
Liischer (see Section 3.2.2). 

Here are some prime numbers of the form b? + b + 1 that satisfy the mixing 
constraint when b = 23? and 50 < k < 100: For subtract-with-borrow, b°” — b! — 1, 
b73 — pl? — 1, 86 — 492 — 1, B88 — b5? — 1, b95 — bêt — 1, b58 — b33 4.1, b? — oI 4.1, 
b9 — 6744.1, b7? — b57 +1, b87 — 441. For add-with-carry, b°° +b??? — 1, 6°! + b*4 — 1, 
BM + b?7 — 1, b°° + 6% — 1. (Less desirable from a mixing standpoint are the primes 
b56 p5 — 1. b56 — p32 — 1. p86 — 957 — 1. p76 — p25 — 1. p84 — 926 — 1. p99 — 94? — 1 
493 _ pis 1; 6°? be +1, 60 bY? 4.1, b7 28 +1, 87 63 +1, b83 — b" f 1; b5 40? 1 
b76 4 btt — 1, b88 4 439 — 1, b? 4 bt 1) 

To calculate the period of the resulting sequences, we need to know the factors 
of m — 1; but this isn’t feasible for such large numbers unless we are extremely lucky. 
Suppose we do succeed in finding the prime factors qi, ..., qt; then the probability that 
b"—D/4 mod m = 1 is extremely small, only 1/q, except for the very small primes q. 
Therefore we can be quite confident that the period of b” mod m is extremely long even 
though we cannot factor m — 1. 

Indeed, the period is almost certainly very long even if m is not prime. Consider, 
for example, the case k = 10, l = 3, b = 10 (which is much too small for random 
number generation but small enough that we can easily compute the exact results). In 
this case (10” mod m} has period length lem(219, 11389520) = 2494304880 when m = 
9999998999 = 439 - 22779041; 4999999500 when m = 9999999001; 5000000499 when 
m = 10000000999; and Icm(1, 16, 2686, 12162) = 130668528 when m = 10000001001 = 
3-17-2687-72973. Rare choices of the seed values may shorten the period when m is not 
prime. But we can hardly go wrong if we choose, say, k = 1000, l = 619, and b = 2°. 


SECTION 3.2.1.3 


1. c = 1 is always relatively prime to B°; and every prime dividing m = B* is a 
divisor of B, so it divides b = B? to at least the second power. 


2. Only 3, so the generator is not recommended in spite of its long period. 
3. The potency is 18 in both cases (see the next exercise). 


4. Since a mod 4 = 1, we must have a mod 8 = 1 or 5, so b mod 8 = 0 or 4. If bis an 
odd multiple of 4, and if bı is a multiple of 8, clearly b° = 0 (modulo 2°) implies that 
bi = 0 (modulo 2°), so bı cannot have higher potency than b. 


5. The potency is the smallest value of s such that fjs > ej for all j. 


6. The modulus must be divisible by 27 or by p* (for odd prime p) in order to have 
a potency as high as 4. The only values are m = 27’ + 1 and 10° — 1. 


7. a’ =(1—b+b? —---)modm, where the terms in b*, b°}, etc., are dropped (if s 
is the potency). 


8. Since Xn is always odd, 
Xango = (2% + 3 - 2" + 9)X, mod 2” = (24 + 6Xn41 — 9Xn) mod 2™. 
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Given Y,, and Y,,+1, the possibilities for 


Yn+2 Z (10 t 6(Yn41 t €1) (Yn t €2)) mod 20, 


with 0 < «1 < 1, 0 < €2 < 1, are limited and nonrandom. 

Note: If the multiplier suggested in exercise 3 were, say, 233 +218 + 2? +1, instead 
of 273 4.213 4 9? 4 1, we would similarly find Xn+2 — 10Xn+41 + 25Xn = constant 
(modulo 25). In general, we do not want a+ 6 to be divisible by high powers of 2 
when 6 is small, else we get “second-order impotency.” See Section 3.3.4 for a more 
detailed discussion. 

The generator that appears in this exercise is discussed in an article by MacLaren 
and Marsaglia, JACM 12 (1965), 83-89. The deficiencies of such generators were first 
demonstrated by M. Greenberger, CACM 8 (1965), 177-179. Yet generators like this 
were still in widespread use more than ten years later (see the discussion of RANDU in 
Section 3.3.4). 


SECTION 3.2.2 


1. The method is useful only with great caution. In the first place, aUn is likely to be 
so large that the addition of c/m that follows will lose almost all significance, and the 
“mod 1” operation will nearly destroy any vestiges of significance that might remain. 
We conclude that double-precision floating point arithmetic is necessary. Even with 
double precision, one must be sure that no rounding, etc., occurs to affect the numbers 
of the sequence in any way, since that would destroy the theoretical grounds for the 
good behavior of the sequence. (But see exercise 23.) 


2. Xn+1 equals either Xn-1 + Xn or Xn-1 + Xn — m. If Xn4i1 < Xn we must have 
Xn+1 = Xn-1+ Xn —m; hence Xn41 < Xn-1. 


3. (a) The underlined numbers are V[j] after step M3. 


Output: initial | 0 45 6 20 3(2 741630 5) and repeats. 
V (0): O;477777T7T7T4T7T7T7T7T77747~.. 
V{l): 3/333333255555552555.. 
V [2]: 2/222203333333033333.. 
V[3}: 5/556111111161111111.. 
X: 47610325476103 2547.. 
Y: 0167452301674523 01. 


So the potency has been reduced to 1! (See further comments in the answer to 
exercise 15.) 
(b) The underlined numbers are V [j] after step B2. 


Output: initial ]236570053...46(30...47).. 
v (0): 0/000000544...1111...11.. 
v{1]: 3|361111111...0004...00.. 
V2): 2|777733337...6222...72.. 
v{[3]: 51555002222...3355...33.. 
X: 4|761032547...3254...32.. 
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In this case the output is considerably better than the input; it enters a repeating cycle 
of length 40 after 46 steps: 236570 05314 72632 40110 37564 76025 12541 73625 03746 
(30175 24061 52317 46203 74531 60425 16753 02647). The cycle can be found easily by 
applying the method of exercise 3.1-7 to the array above until a column is repeated. 

4. The low-order byte of many random sequences (e.g., linear congruential sequences 
with m = word size) is much less random than the high-order byte. See Section 3.2.1.1. 

5. The randomizing effect would be quite minimized, because V[j] would always 
contain a number in a certain range, essentially j/k < V[j]/m < (j +1)/k. However, 
some similar approaches could be used: We could take Yn = Xn-—1, or we could choose j 
from Xn by extracting some digits from the middle instead of at the extreme left. 
None of these suggestions would produce a lengthening of the period analogous to 
the behavior of Algorithm B. (Exercise 27 shows, however, that Algorithm B doesn’t 
necessarily increase the period length.) 

6. For example, if X,/m < 5 then Xy41 = 2X. 

7. [W. Mantel, Nieuw Archief voor Wiskunde (2) 1 (1897), 172-184.] 


00...01 00.. o1 
The subsequence of 00...10 a 
X values: e308 becomes: 10...00 
10...00 00...00 
CONTENTS (AJ CONTENTS (A) 


8. We may assume that Xo = 0 and m = pf, as in the proof of Theorem 3.2.1.2A. 
First suppose that the sequence has period length p*; it follows that the period of 
the sequence mod pf has length p”, for 1 < f < e, otherwise some residues mod pf 
would never occur. Clearly, c is not a multiple of p, for otherwise each Xn» would 
be a multiple of p. If p < 3, it is easy to establish the necessity of conditions (iii) 
and (iv) by trial and error, so we may assume that p > 5. If d #0 (modulo p) then 
dr? +az +c = d(x 4 a1)? + cı (modulo pf) for some integers aı and cı and for all 
integers x; this quadratic takes the same value at the points x and —x — 2a, so it 
cannot assume all values modulo p°. Hence d = 0 (modulo p); and if a # 1, we would 
have dx? +ar+c=a (modulo p) for some x, contradicting the fact that the sequence 
mod p has period length p. 

To show the sufficiency of the conditions, we may assume by Theorem 3.2.1.2A and 
consideration of some trivial cases that m = p° where e > 2. If p = 2, we have Xn42 = 
Xn +2 (modulo 4), by trial; and if p = 3, we have Xn+3 = Xn—d+3c (modulo 9), using 
(i) and (ii). For p > 5, we can prove that Xn4p = Xn + pc (modulo p°): Let d= pr, 
a=1+ps. Then if Xn =cn+pYn (modulo p’), we must have Yni1 = n?e r+nces+ Yn 
(modulo p); hence Y, = (3)22r + (Z) (r + cs) (modulo p). Thus Y, mod p = 0, and 
the desired relation has been proved. 

Now we can prove that the sequence (Xn) of integers defined in the “hint” satisfies 
the relation 


Xn4pf = Xn + tp? (modulo p++), n> 0, 


for some t with t mod p Æ 0, and for all f > 1. This suffices to prove that the sequence 
(Xn mod p°) has period length pf, for the length of the period is a divisor of p° but 
not a divisor of p°-+. The relation above has already been established for f = 1, and 
for f > 1 it can be proved by induction in the following manner: Let 


Xntpt = Xn + tp? + Znp?™! (modulo p/*?); 
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then the quadratic law for generating the sequence, with d = pr, a = 1+ ps, yields 
Zn41 = 2rtnc+ st + Zn (modulo p). It follows that Zn+p = Zn (modulo p); hence 


Xnskpf = Xn + k(tp! + Zap’) (modulo p**?) 


for k = 1,2,3,...; setting k = p completes the proof. 

Notes: If f(x) is a polynomial of degree higher than 2 and Xn4+1 = f(Xn), 
the analysis is more complicated, although we can use the fact that f(m + p") = 
f(m) + p" f'(m) + p™ f""(m)/2! +--+ to prove that many polynomial recurrences give 
the maximum period. For example, Coveyou has proved that the period is m = 2° if 
f (0) is odd, f’(7) =1, f” (j) =0, and f(j +1) = f(7) +1 (modulo 4) for j = 0,1, 2,3. 
[Studies in Applied Math. 3 (Philadelphia: SIAM, 1969), 70-111.] 

9. Let Xn = 4Yn +2; then the sequence Yn satisfies the quadratic recurrence Yn+1 = 
(4Y; + 5Yn + 1) mod 2°”. 
10. Case 1: Xo = 0, X1 = 1; hence Xn = Fn. We seek the smallest n for which Fn = 0 
and Fr41 = 1 (modulo 2°). Since Fan = Fn(Fn-1+ Fr41), Ponti = F? + F241, we find 
by induction on e that, for e > 1, F3.9e-1 = 0 and Fy.9e-14, = 2° + 1 (modulo 2°*"). 
This implies that the period is a divisor of 3 - 2°71 but not a divisor of 3-2°~, so it is 
either 3- 2°~' or 2°71. But Fye-1 is always odd (since only Fsn is even). 

Case 2: Xo = a, Xı = b. Then Xn = aFn-1 + bFn; we need to find the smallest 
positive n with a(Fn41 — Fn) + bFn = a and aFn + bFn41 = b. This implies that 
(b? — ab — a”) Fa = 0, (b? — ab — a”)(Fn41 — 1) = 0. And b? — ab — a? is odd (that is, 
prime to m); so the condition is equivalent to Fn = 0, Fr4i = 1. 

Methods to determine the period of (Fn) for any modulus appear in an article by 
D. D. Wall, AMM 67 (1960), 525-532. Further facts about the Fibonacci sequence 
mod 2° have been derived by B. Jansson [Random Number Generators (Stockholm: 
Almqvist & Wiksell, 1966), Section 3C1]. 

11. (a) We have z* = 1 + f(z)u(z) + p*v(z) for some u(z) and v(z), where v(z) # 0 
(modulo f(z) and p). By the binomial theorem, 


2? = 1+ pt u(z) + pe v(z)?(p — 1)/2 


plus further terms congruent to zero (modulo f(z) and p®*t?). Since p° > 2, we have 
zòP = 14 p®*tv(z) (modulo f(z) and p*t?). If p**tu(z) = 0 (modulo f(z) and p**?), 
there must exist polynomials a(z) and b(z) such that p®*!(v(z) + pa(z)) = f(z)b(2). 
Since f(0) = 1, this implies that b(z) is a multiple of p°t’ (by Gauss’s Lemma 4.6.1G); 
hence v(z) = 0 (modulo f(z) and p), a contradiction. 

(b) If 2* —1= f(z)u(z) + p*v(z), then 


G(z) = u(z)/(z* — 1) + p®o(z)/F(2)(2* — 1); 


hence An+\ = An (modulo p°) for large n. Conversely, if (An) has the latter property 
then G(z) = u(z) + v(z)/(1 — zò) + p® H(z), for some polynomials u(z) and v(z), 
and some power series H(z), all with integer coefficients. This implies the identity 
12 =u(z)f(\(1— 2) + vl) f() HEN- 2); and H(z) f(\(1—”) isa 
polynomial since the other terms of the equation are polynomials. 

c) It suffices to prove that \(p*) # A(p*t") implies that A(p°*t) = pA(p®) # 
A(p**?). Applying (a) and (b), we know that A(p°*?) 4 pA(p*), and that A(p*t") is 
a divisor of pA(p°) but not of A(p°). Hence if A(p°) = pq, where qmod p Æ 0, then 
A(p°**) must be pf*!d, where d is a divisor of q. But now Xn+pf+1a = Xn (modulo p°); 


a 


hence pf*!d is a multiple of pfq, hence d = q. [Note: The hypothesis pê > 2 is 
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necessary; for example, let a; = 4, a2 1, k = 2; then (An) = 1, 4, 15, 56, 209, 780, 
..5 A(2) = 2, A(4) = 4, A(8) = 4] 

(d) g(z) = Xot+(X1—a1Xo)z+- --+(Xk-1—a1Xk-2—a2Xk-3—--—ak-1X0)z"7t. 

(e) The derivation in (b) can be generalized to the case G(z) = g(z)/f(z); then 
the assumption of period length A implies that g(z)(1 — z*) = 0 (modulo f(z) and p°); 
we treated only the special case g(z) = 1 above. But both sides of this congruence can 
be multiplied by Hensel’s b(z), and we obtain 1 — z* = 0 (modulo f(z) and p°). 

Note: A more “elementary” proof of the result in (c) can be given without using 
generating functions, using methods analogous to those in the answer to exercise 8: If 
Azn = An + p° Bn, forn=r,r+1,...,7r+k—1 and some integers Bn, then this 
same relation holds for all n > r if we define Br+k, Br+k+1,... by the given recurrence 
relation. Since the resulting sequence of B’s is some linear combination of shifts of 
the sequence of A’s, we will have By, = Bn (modulo pf) for all large enough values 
of n. Now X(p°*") must be some multiple of A = \(p*); for all large enough n we have 
Anja = An + p°(Bn + Brya + Broa +++ + Bn40-1)a) = An + jp°Bn (modulo p”*) 
for j = 1,2,3,.... No k consecutive B’s are multiples of p; hence \(p°*') = pA(p°) 4 
A(p**?) follows immediately when e > 2. We still must prove that \(p°*?) 4 pA(p°) 
when p is odd and e = 1; here we let By4n = Bn + pCn, and observe that Cri, = Cn 
(modulo p) when n is large enough. Then An+p = An +p” (Bn + (8)Cn) (modulo p°), 
and the proof is readily completed. 

For the history of this problem, see Morgan Ward, Trans. Amer. Math. Soc. 35 
(1933), 600-628; see also D. W. Robinson, AMM 73 (1966), 619-621. 


12. The period length mod 2 can be at most 4; and the period length mod is at 
most twice the maximum length mod 2°, by the considerations of the previous exercise. 
So the maximum conceivable period length is 2°*'; this is achievable, for example, in 
the trivial case a= 0, b= c= 1. 


gett 


13,14. Clearly Zn+\ = Zn, so A is certainly a divisor of A. Let the least common 
multiple of X’ and A; be 44, and define à% similarly. We have Xn+Yn = Zn = Zn4 x = 
Xn + Yapa 50 à} is a multiple of A. Similarly, å is a multiple of A,. This yields 


the desired result. (The result is “best possible” in the sense that sequences for which 
A’ = Xo can be constructed, as well as sequences for which AX’ = À.) 


15. Algorithm M generates (Xn+x, Yn) in step M1 and outputs Zn = Xn+k-qn in step 
M3, for all sufficiently large n. Thus (Zn) has a period of length \’, where A’ is the 
least positive integer such that Xn+k-qn = Xna +h—-ay4y/ for all large n. Since A is a 
multiple of \1 and Az, it follows that A’ is a divisor of À. (These observations are due 
to Alan G. Waterman.) 

We also have n + k — qn = n +A’ + k-— qnx (modulo 1) for all large n, by the 
distinctness of the X’s. The bound on (qn) implies that gn4./ = qn + c for all large n, 
where c = X’ (modulo Aj) and |c| < $1. But c must be 0 since (qn) is bounded. Hence 
A = 0 (modulo Ax), and qn+x = qn for all large n; it follows that à’ is a multiple of 
àz and Ai, so X = À. 

Note: The answer to exercise 3.2.1.2—4 implies that when (Yn) is a linear congru- 
ential sequence of maximum period modulo m = 2°, the period length A2 will be at 
most 2°~? when k is a power of 2. 


16. There are several methods of proof. 
(1) Using the theory of finite fields. In the field with 2* elements let € satisfy 
EF = aT 4 --- + ag. Let f(bi€*~' +--- + by) = bk, where each b; is either zero 
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or one; this is a linear function. If word X in the generation algorithm is (bi b2...bx)2 
before (10) is executed, and if bi€*~'+---+b,€° = £”, then word X represents €"*" after 
(10) is executed. Hence the sequence is f(E”), F(E), F(E), ...; and f(€"**) = 
FEEF) = fare? +--+ + apg”) = ar FETE) +--+ + an f(E”). 

(2) Using brute force, or elementary ingenuity. We are given a sequence Xnj, 
n >0,1< j< k, satisfying 


X(n41)j = Xn(j+1) tajXni, 1<j<k; X (n+1)k =arXni (modulo 2). 


We must show that this implies Xn = a1X(n—1yp ++ ++ +akX(n-k)k, for n > k. Indeed, 
it implies Xnj = a1X(n-1)j +++ + akX(n-k)j When 1 <j <k <n. This is clear for 
j=1, since Xnı = a1X(n—1)1 + X(n—1)2 = a1X(n—1)1 + a2X(n—2)1 + X(n—2)3; etc. For 
j > 1, we have by induction 


Xnj = X(n41)(G-1) — 45-1 X01 


DS tXan-)g-) -4-1 J, Xora 


1<i<k 1<i<k 
= 5 ai(X(n41 i)(j—1) Z Qj 1X(n i)1) 
1<i<k 


= a X(n-1)j F F akX(n-k)j: 


This proof does not depend on the fact that operations were done modulo 2, or modulo 
any prime number. 


17. (a) When the sequence terminates, the (k — 1)-tuple (Xn41,..-,Xn4e—1) occurs 
for the (m + 1)st time. A given (k — 1)-tuple (Xr+1,..., Xr+k—-1) can have only m 
distinct predecessors X,+, so one of these occurrences must be for r = 0. (b) Since 
the (k — 1)-tuple (0,...,0) occurs (m + 1) times, each possible predecessor appears, 
so the k-tuple (a1,0,...,0) appears for all a1, O < ay < m. Let 1 < s < k and 
suppose we have proved that all k-tuples (a1,...,a@s,0,...,0) appear in the sequence 
when as # 0. By the construction, this k-tuple would not be in the sequence unless 
(a1,...,@s,0,...,0,y) had appeared earlier for 1 < y < m. Hence the (k — 1)-tuple 
(a1,...,@s,0,...,0) has appeared m times, and all m possible predecessors appear; this 
means that (a,a1,...,@5,0,...,0) appears for 0 < a < m. The proof is now complete 
by induction. 

The result also follows from Theorem 2.3.4.2D, using the directed graph of exercise 
2.3.4.2-23. The arcs from (#1,...,%5,0,...,0) to (w2,...,2,;,0,0,...,0), where x; #0 
and 1 < j < k, form an oriented subtree related neatly to Dewey decimal notation. 


18. By exercise 16, the most significant bit of Un+1 is completely determined by the 
first and third bits of Un, so only 32 of the 64 possible pairs (|8Un]|,|8Un+1]) occur. 
[Notes: If we had used, say, 11-bit numbers Un = (.X11nXiin+i---X1in+i0)2, the 
sequence would be satisfactory for many applications. If another constant appears in 
A having more 1 bits, the generalized spectral test might give some indication of its 
suitability. See exercise 3.3.4—24; we could examine 1 in dimensions t = 36, 37, 38,... .] 


20. For k = 64 one can use CONTENTS (A) = (243F6A8885A308D3)16 (the bits of z!). 

21. [J. London Math. Soc. 21 (1946), 169-172.] Any sequence of period length m* — 1 
with no k consecutive zeros leads to a sequence of period length m” by inserting a zero 
in the appropriate place, as in exercise 7; conversely, we can start with a sequence of 


period length m” and delete an appropriate zero from the period, to form a sequence of 
the other type. Let us call these “(m, k) sequences” of types A and B. The hypothesis 
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assures us of the existence of (p, k) sequences of type A, for all primes p and all k > 1; 
hence we have (p,k) sequences of type B for all such p and k. 

To get a (p°, k) sequence of type B, let e = qr, where q is a power of p and r is not 
a multiple of p. Start with a (p, qrk) sequence of type A, namely Xo, X1, X2,...; then 
(using the p-ary number system) the grouped digits (Xo...Xg—1)p, (Xq---X2q-1)p;--- 
form a (p?,rk) sequence of type A, since q is relatively prime to pF — 1 and the 
sequence therefore has a period length of p! — 1. This leads to a (p%, rk) sequence 
(Yn) of type B; and (YoY1... Y¥r—1)pa, (YrYr41..- Yor—1)pa, .-. is a (p%", k) sequence of 
type B by a similar argument, since r is relatively prime to p%. 

To get an (m,k) sequence of type B for arbitrary m, we can combine (pê, k) 
sequences for each of the prime power factors of m using the Chinese remainder 
theorem; but a simpler method is available. Let (Xn) be an (r, k) sequence of type B, 
and let (Yn) be an (s, k) sequence of type B, where r and s are relatively prime; then 
((Xn + Yn) mod rs) is an (rs, k) sequence of type B, by exercise 13. 

A simple, uniform construction that yields (2, k) sequences for arbitrary k has 
been discovered by A. Lempel [IEEE Trans. C-19 (1970), 1204-1209]. 


22. By the Chinese remainder theorem, we can find constants a1,...,a,% having desired 
residues modulo each prime divisor of m. If m = pip2...pz, the period length will be 
lcm(pf = he 1). In fact, we can achieve reasonably long periods for arbitrary m 
(not necessarily squarefree), as shown in exercise 11. 
23. Subtraction may be faster than addition, see exercise 3.2.1.1—5; the period length 
is still 2°71(255 — 1), by exercise 30. R. Brent has pointed out that the calculations can 
be done exactly on floating point numbers in [0..1); see exercise 3.6—11. 
24. Run the sequence backwards. In other words, if Zn = Y-n we have Zn = 
(Zn-k+1 — Zrak) mod 2 = (Zn + Zak) mod 2. 
25. This idea can save most of the overhead of subroutine calls. For example, suppose 
Program A is invoked by calling JMP RANDM, where we have 

RANDM STJ 1F 


LDA Y,6 
: Program A 
ENT6 55 

1H JMP * l 


The cost per random number is then 14 + á units of time. But suppose we generate 
random numbers by saying ‘DEC6 1; J6Z RNGEN; LDA Y,6’ instead, with the subroutine 


RNGEN STJ 1F ENT6 31 
ENT6 24 LDA Y,6 
LDA Y+31,6 ADD Y+24,6 
ADD Y,6 STA Y,6 
STA Y+31,6 DEC6 1 
DEC6 1 J6P *-4 
J6P *-4 ENT6 55 
1H JMP * l 


The cost is now only (124+ 8)u. [A similar implementation, expressed in the C language, 
is used in The Stanford GraphBase (New York: ACM Press, 1994), GB_FLIP.] Indeed, 
many applications find it preferable to generate an array of random numbers all at 
once. Moreover, the latter approach is essentially mandatory when we enhance the 
randomness with Ltischer’s method; see the C and FORTRAN routines in Section 3.6. 
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27. Let Jn = |kXn/m]. Lemma. After the (k? + 7k — 2)/2 consecutive values 
0t? 10t 20 as eae 


occur in the (Jn) sequence, Algorithm B will have V[j] < m/k for 0 < j < k, and also 
Y < m/k. Proof. Let Sn be the set of positions j such that V[j] < m/k just before Xn 
is generated, and let jn be the index such that V[jn] — Xn. If jn ¢ Sn and Jn = 0, 
then Sn+1 = Sn U {jn} and jn41 > 0; if jn E Sn and Jn = 0, then Sn41 = Sn and 
jn+1 = 0. After k+ 2 successive Os, we must therefore have 0 € Sn and jn41 = 0. Then 
after “1 0+!” we must have {0,1} C Sn and jny1 = 0; after “2 0%” we must have 
{0,1,2} C Sn and jn+1 = 0; and so on. 

Corollary. Let | = (k? +7k—2)/2. If\ > lk’, either Algorithm B yields a period 
of length A or the sequence (Xn) is poorly distributed. Proof. The probability that 
any given length- pattern of J’s does not occur in a random sequence of length A is 
less than (1 — k7™’)ò? < exp(—k7'A/I) < e7}; hence the stated pattern should appear. 
After it does, the subsequent behavior of Algorithm B will be the same each time it 
reaches this part of the period. (When k > 4, we are requiring À > 1071, so this result 
is purely academic. But smaller bounds may be possible.) 


29. The following algorithm performs about k? operations in the worst case, but its 
average running time is much faster, perhaps O(log k) or even O(1): 
X1. Set (ao, @1,..., ak) — (z1,..., £k; M— 1). 
X2. Let i be minimum with a; > 0 andi > 0. Do subroutine Y for j = i + 1, 
..., k, while ax > 0. 
X3. If ao > ak, f(£1,..., £k) = ao; otherwise if ao > 0, f(x1,..., £k) = ao — 1; 
otherwise f(x1,..., £k) =ak. | 
Y1. Set l + 0. (The subroutine in steps Y1-Y3 essentially tests the lexicographic 
relation (ai,...,Qi+k—-1) > (@j,---,@j+%—1), decreasing a; if necessary to 
make this inequality true. We assume that akı = a1, ak+2 = az, etc.) 
Y2. If ai}ı > aj+ı, exit the subroutine. Otherwise if j + l = k, set ak © aj4t. 
Otherwise if ai+ı = aj+41, go on to step Y3. Otherwise if j +1 > k, decrease 
ak by 1 and exit. Otherwise set a, + 0 and exit. 


Y3. Increase l by 1, and return to step Y2 ifl< k. J 


This problem was first solved by H. Fredricksen when m = 2 [J. Combinatorial 
Theory 9 (1970), 1-5; A12 (1972), 153-154]; in that special case the algorithm is 
simpler and it can be done with k-bit registers. See also H. Fredricksen and J. Maiorana, 
Discrete Math. 23 (1978), 207-210, who essentially discovered Algorithm 7.2.1.1F. 


30. (a) By exercise 11, it suffices to show that the period length mod 8 is 4(2*—1); this 
will be true if and only if 22(2”-1) # 1 (modulo 8 and f(z)), if and only if #2"-1 Æ 1 
(modulo 4 and f(x)). Write f(a) = fe(x?)+2xfo(x), where fe(2”) = 4(f(x)+f(—2)). 
Then f(x)? + f(—a)? = 2f(x?) (modulo 8) if and only if fe(x)? + rfo(x)? = f(x) 
(modulo 4); and the latter condition holds if and only if fe(£)? = —axfo(x)? (modulo 4 
and f(x)), because fe(£)?+xfo(£)? = f(z)+O(2*—'). Furthermore, working modulo 2 
and f(x), we have fe(£)? = fe(a”) = wfo(x?) = x2" f,(ax)*, hence fe(x) = x2"~* fole). 
Therefore fe(x)? = #2" fo(x)? (modulo 4 and f(x)), and the hint follows. A similar 
argument proves that 22" = x (modulo 4 and f(2)) if and only if f(x)? + f(-x} = 
2(—1)* f(—x?) (modulo 8). 

(b) The condition can hold only when l is odd and k = 2l. But then f(x) is 
primitive modulo 2 only when k = 2. [Math. Comp. 63 (1994), 389-401.] 
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31. We have X, = (—1)*"37" mod 2° for some Y, and Zn, by Theorem 3.2.1.2C; 
hence Yn = (Yn—24 + Yn-55) mod 2 and Zn = (Zn—24 + Zn-55) mod 2°~*. Since Zķ is 
odd if and only if Xx mod 8 = 3 or 5, the period length is 2°~3(2°° — 1) by the previous 
exercise. 


32. We can ignore the ‘mod m’ and put it back afterwards. The generating function 
g(z) = 0, Xnz” is a polynomial multiple of 1/(1 — 274 — z°°); hence 37, Xan2?" = 
3(g(z) + g(—z)) is a polynomial divided by (1 — 2** — 2°°)(1— 2744 25) = 1-224 
248 — z110, The first desired recurrence is therefore Xən = (2X e(n-12) — Xa(n—24) + 
X2(n—55)) mod m. Similarly, >, X3nz3" = $(9(z) + g(wz) + g(w2z)) where w = e27#/3, 
and we find X3n = (3X3(n-8) = 3X3(n—16) + X3(n—24) + X3(n—55)) mod m. 


33. (a) gn+t(2) = 2°gn(z) (modulo m and 1 + z" — z°°), by induction on t. (b) Since 
z500 mod (1 + 23! — 255) = 79227 + 2° + 172° + 71529 + 36217 + 213 + 364216 4 210219 
105279 +46227° +1620 + 1287273 +9279 + 182°” +1001 24° +120243 +24 +455 247 +4622” 
120254 (see Algorithm 4.6.1D), we have X500 = (792X2 + X5 +---+120X54) mod m. 

[It is interesting to compare the similar formula Xi65 = (Xo + 3X7 + X14 
3X31 +4X3g + X45) mod m to the sparser recurrence for (Xan) in the previous exercise. 
Liischer’s method of generating 165 numbers and using only the first 55 is clearly 
superior to the idea of generating 165 and using only X3, X6, ..., X165-] 


4dn—-1 


34. Let go = 0, qı = 1, nti = Cdn + aqn-ı. Then we have (°? 1)” = ( AR ana) 


Xn = (qn+1X0o + aqn)/(qnXo + agn-1), and x” mod f(x) = qnx + agn-1, for n > 1. 
Thus if Xo = 0 we have Xn = 0 if and only if x” mod f(z) is a nonzero constant. 


35. Conditions (i) and (ii) imply that f(x) is irreducible. For if f(x) = (x —rı)(x— r2) 
and rir2 Æ 0 we have we l=alifr; Æ r2 and a? = r1 if rı = r2. 

Let € be a primitive root of a field with p? elements, and suppose £2" = cpé* + ag. 
The quadratic polynomials we seek are precisely the polynomials f(x) = £? —cpx — ak 
where 1 <k<p?—landk Lp+1. (See exercise 4.6.2-16.) Each polynomial occurs 


for two values of k; hence the number of solutions is (p° — 1) ite acto b= 1/0). 


36. In this case X,, is always odd, so X,,* exists mod 2°. The sequence (qn) defined in 
answer 34 is 0, 1, 2, 1, 0, 1, 2, 1, ... modulo 4. We also have gan = qn(qn+1 + agn-1) 
and gon—1 = aqa—1 + qa; hence q2n+1 — ag2n—1 = (Gn41 — aqn—1)(qn+1 + aqn+1). Since 
qn+1 + aqn+1 = 2 (modulo 4) when n is even, we deduce that q2e is an odd multiple 
of 2° and q2e+1 — aq2e—ı is an odd multiple of 2°++, for all e > 0. Therefore 


q2e + aq2e—1 = qoe41 + aqe 4 gett (modulo Der), 


And Xge-2 = (qge-241 + @qge—2) /(Qge—2 + 4qge-2_1) Æ 1 (modulo 2°), while Xj--1 = 1. 
Conversely, we need a mod 4 = 1 and cmod4 = 2; otherwise X2n = 1 (modulo 8). 
[Eichenauer, Lehn, and Topuzoğlu, Math. Comp. 51 (1988), 757—759.] The low-order 
bits of this sequence have a short period, so inversive generators with prime modulus 
are preferable. 


37. We can assume that bı = 0. By exercise 34, a typical vector in V is 
(x, (sha + as2)/(s2£ +as)),...,(syx +asa)/(sax + as4)), 


where sj = q;, S} = 42,41, 8} = %;-1- This vector belongs to the hyperplane H if 
and only if 
rata , Tata 


, , - 1—1 Fosi 
ra 4 fss+ = fro — 12828) —-++:—TaSqSq_ (modulo p), 
x + u2 £+ Ua 
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where t; = a—as‘,s''s;7 = —(—a)°is;? and uj = as/s;'. But this relation is equivalent 
to a polynomial congruence of degree < d; so it cannot hold for d+ 1 values of x 
unless it holds for all x, including the distinct points x = u2, ..., £ = ua. Hence 
rg =+++=rq = 0, and rı = 0. [See J. Eichenauer-Herrmann, Math. Comp. 56 (1991), 
297-301.] 

Notes: If we consider the (p+1—d) x (d+1) matrix M with rows {(1,v1,..., va) | 
(v1,...,va) E V}, this exercise is equivalent to the assertion that any d+ 1 rows of M 
are linearly independent modulo p. It is interesting to plot the points (Xn, Xn+1) for 


p = 1000 and 0 < n < p; traces of circles, rather than straight lines, meet the eye. 


SECTION 3.3.1 


1. There are k = 11 categories, so the line v = 10 should be used. 


2 3 4 5 6 9 6 5 4 3 2 
49° 49° 49° 49° 49° 49° 49° 49° 49° 49° 49° 


3. V = 738, only very slightly higher than that obtained from the good dice! 
There are two reasons why we do not detect the weighting: (a) The new probabilities 
(see exercise 2) are not really very far from the old ones in Eq. (1). The sum of the two 
dice tends to smooth out the probabilities; if we counted instead each of the 36 possible 
pairs of values, we would probably detect the difference quite rapidly (assuming that 
the two dice are distinguishable). (b) A far more important reason is that n is too 
small for a significant difference to be detected. If the same experiment is done for 


large enough n, the faulty dice will be discovered (see exercise 12). 


4. ps = 5 for 2 < s < 12 and s Æ T; p7 = i. The value of V is 164, which falls 


between the 75% and 95% entries in Table 1; so it is reasonable, in spite of the fact 
that not too many sevens actually turned up. 


5. Ki = 1.15; Kọ = 0.215; these values do not differ significantly from random 
behavior (being at about the 94% and 86% levels), but they are mighty close. (The 
data values in this exercise come from Appendix A, Table 1.) 


6. The probability that X; < x is F(a), so we have the binomial distribution discussed 
in Section 1.2.10: F,(2%) = s/n with probability (7) F(a)°(1 — F(«))” °; the mean 
is F(a); the standard deviation is ,/F(x)(1— F(x))/n. [See Eq. 1.2.10-(19). This 
suggests that a slightly better statistic would be to define 


Ka = va max (Fale) - F(a))/VF@)C — FE); 


see exercise 22. We can calculate the mean and standard deviation of F,(y) — Fn(x), 
for x < y, and obtain the covariance of Fa(x) and F,(y). Using these facts, it can be 
shown that for large values of n the function F(x) behaves as a “Brownian motion,” 
and techniques from this branch of probability theory may be used to study it. The 
situation is exploited in articles by J. L. Doob and M. D. Donsker, Annals Math. Stat. 
20 (1949), 393-403 and 23 (1952), 277-281; their approach is generally regarded as 
the most enlightening way to study the KS tests.] 


7. Set j = n in Eq. (13) to see that K} is never negative, and that it can get as high 
as yn. Similarly, set j = 1 to make the same observations about K7. 


8. The new KS statistic was computed for 20 observations. The distribution of Kib 
was used as F(x) when the KS statistic was computed. 


9. The idea is erroneous, because all of the observations must be independent. There 
is a relation between the statistics K} and KĀ on the same data, so each test should be 
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judged separately. (A high value of one tends to give a low value of the other.) Similarly, 
the entries in Figs. 2 and 5, which show 15 tests for each generator, do not show 15 
independent observations, because the maximum-of-5 test is not independent of the 
maximum-of-4 test. The three tests of each horizontal row are independent (because 
they were done on different parts of the sequence), but the five tests in a column are 
somewhat correlated. The net effect of this is that the 95-percent probability levels, 
etc., which apply to one test, cannot legitimately be applied to a whole group of tests 
on the same data. Moral: When testing a random number generator, we may expect 
it to “pass” each of several tests, like the frequency test, maximum test, and run test; 
but an array of data from several different tests should not be considered as a unit 
since the tests themselves may not be independent. The K} and K} statistics should 
be considered as two separate tests; a good source of random numbers will pass both. 


10. Each Y, is doubled, and nps is doubled, so the numerators of (6) are quadrupled 
while the denominators only double. Hence the new value of V is twice as high as the 
old one. 


11. The empirical distribution function stays the same; the values of K} and K; are 
multiplied by V2. 


12. Let Zs = (Ys — nqs)/Wngs. The value of V is n times 


k 


XO (as — ps + V'4s/nZs} /ps, 


#=1 


and the latter quantity stays bounded away from zero as n increases (since Zsn™"* 
is bounded with probability 1). Hence the value of V will increase to a value that is 
extremely improbable under the p, assumption. 

For the KS test, let F(a) be the assumed distribution, G(x) the actual distribution, 
and let h = max |G(x)— F(a)|. Take n large enough so that |F,(x) — G(a)| > h/2 
occurs with very small probability; then |Fn (x) — F'(«)| will be improbably high under 
the assumed distribution F(x). 


13. (The “max” notation should really be replaced by “sup” since a least upper bound 
is meant; however, “max” was used in the text to avoid confusing too many readers by 
the less familiar “sup” notation.) For convenience, let Xo = —0o0, Xn+41 = +00. When 
Xj < x < Xj41, we have F,(x) = j/n; therefore max(Fn(x) — F(x)) = j/n — F(X;) 
and max(F (x) — F,(x)) = F(X;+1) — j/n in this interval. As j varies from 0 to n, all 
real values of x are considered; this proves that 


Kt = va max (2 - F(x); 


K, = vn max (rœ) - =). 


These equalities are equivalent to (13), since the extra term under the maximum signs 
is nonpositive and it must be redundant by exercise 7. 


14. The logarithm of the left-hand side simplifies to 


x Z £ Z 1 
-X Y,ln 5 +E in( 2an)— = In ps — in( = )+0(2). 
D¥m(+ ee) pdm 3 o( ee) +g 
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and this quantity simplifies further (upon expanding ln(1 + Zs/vVnps) and realizing 


that Di ZsVnps = 0) to 


Iyn i LE in(ann) ~ Z mr.) +0 (=). 


15. The corresponding Jacobian determinant is easily evaluated by (i) removing the 
factor r”! from the determinant, (ii) expanding the resulting determinant by the co- 
factors of the row containing “cos 6; — sin; 0...0” (each of the cofactor determinants 
may be evaluated by induction), and (iii) recalling that sin? 01 + cos? 01 = 1. 


zV2r+y ue 2 1 zv 2r ue 
16. —-—4+---)du = ye +0 -£ ++) du. 
Í exp( on + ) UW ye (=) f exp On + u 


The latter integral is 


zV2n 2/9 1 zV2n 2/9 2 1 
e z du ee, = T d O . 
J * 3a? i we (Zz) 


When all is put together, the final result is 


y(x +1, 27+ 2V2x +y) = ae Par me e7 (y 2 222) | o(-) 
I'(a+1) ~ an | fone as . , 


If we set 2/2 = xp and write 


zV/2 
- ae 5; (5 5)/ (5) 
= du =p, ea v uj TS 
zJ e u=p_ a ool ee z) =P 


where t/2 = x + zv2x + y, we can solve for y to obtain y = 3(1+ 2°) + O(1/Vz), 
which is consistent with the analysis above. The solution is therefore t = v + 2,/vz + 


427-24 O(1/vv). 

17. (a) Change of variable, x; + x; + t. 5 
(b) Induction on n; by definition, Pno(a — t) = Pinion — t) dan. 
(c) The left-hand side is r 


xz+t Tk+2 k Ek z2 
f dEn ... f d£k+1 times f dx, dxp_-1... f dx,. 
n k+1 t t t 


E (r-t) (x +t- r)! 


(d) From (b) and (c) we have P,;,(x) = 
The numerator in (24) is Paje; (n). 


r=0 


18. We may assume that F(x) = x for 0 < x < 1, as remarked in the text’s derivation 
of (24). If 0 < Xı < +--+ < Xn < 1, let Zj = 1 — Xn4i_;. We have 0 < Z% <- < 
Zn < 1; and K} evaluated for X1,..., Xn equals K} evaluated for Z1,...,Zn. This 
symmetrical relation gives a one-to-one correspondence between sets of egual volume 
for which K} and K; fall in a given range. 


20. For example, the term O(1/n) is —(4s*—2s7)/n+O(n ~3/2)_ A complete expansion 


has been obtained by H. A. Lauwerier, Zeitschrift für Wahrscheinlichkeitstheorie und 
verwandte Gebiete 2 (1963), 61-68. 
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23. Let m be any number > n. (a) If |[mF(X;)| = |mF(X,;)| and i > j, then 
i/n — F(Xi) > j/n — F(X;). (b) Start with a, = 1.0, bx = 0.0, and c = O for 
0<k<m. Then do the following for each observation X;: Set Y «+ F(X;), k + 
[mY |, ax + min(ak, Y), bk + max(by, Y), Ck — ce +1. (Assume that F(X,;) < 1 so 
that k < m.) Then set j + 0, rt + r7 + 0, and for k = 0,1, ..., m—1 (in this 
order) do the following whenever cp > 0: Set r~ < max(r~,ax — j/n), j 4 j + ck, 
rt < max(rt, j/n — bp). Finally set K} 4+ ynrt, Ka + vnr. The time required 
is O(m +n), and the precise value of n need not be known in advance. (If the estimate 
(k+ 2) /m is used for az and bg, so that only the values cx are actually computed for 
each k, we obtain estimates of K} and K3 good to within iyn/m, even when m < n.) 
[ACM Trans. Math. Software 3 (1977), 60—64.] 


25. (a) Since cij = E(k Uik Xk Sii ajiXı) = S rai likje We have C = AAT. 
(b) Consider the singular value decomposition A = U DVT, where U and V are 
orthogonal of sizes m x m and n x n, and D is m x n with entries di; = [i= j]c;; 
the singular values o; are all positive. [See, for example, Golub and Van Loan, Matrix 
Computations (1996), §2.5.3.] If CCC = C we have SBS = S, where S = DDT 
and B = U7CU. Thus si; = [i= j]o}, where we let onı = + = om = 0, and 
S= Doka SikbkiSij = 0707 bij. Consequently b;; = [i= j]/o} if i,j < n, and we deduce 
that DTBD is the n x n identity matrix. Let Y = (Yı — p1, ..., Ym — ho and X = 
(X1,...,Xn)7; it follows that W = YTCY = XTATCAX = XTV DTBDVTX = XTX. 


SECTION 3.3.2 


1. The observations for a chi-square test must be independent. In the second se- 
quence, successive observations are manifestly dependent, since the second component 
of one equals the first component of the next. 


2. Form t-tuples (Y;t,..., Yjt+t-1), for 0 < j < n, and count how many of them are 
equal to each possible value. Apply the chi-square test with k = d* and with probability 
1/d* in each category. The number of observations, n, should be at least 5d*. 


3. The probability that exactly 7 values are examined, namely the probability that 

U;-1 is the nth element that lies in the range a < Uj—1ı < 2, is easily seen to be 
Coirea 

by enumeration of the possible places in which the other n — 1 occurrences can appear 
and by evaluation of the probability of such a pattern. The generating function is 
G(z) = (pz/(1 — (1 — p)z))”, which makes sense since the given distribution is the 
n-fold convolution of the same thing for n = 1. Hence the mean and variance are 
proportional to n; the number of U’s to be examined is now easily found to have the 
characteristics (min n, ave n/p, max œo, dev \/n(1 — p)/p). A more detailed discussion 
of this probability distribution when n = 1 may be found in the answer to exercise 
3.4.1-17; see also the considerably more general results of exercise 2.3.4.2—26. 


4. The probability of a gap of length > r is the probability that r consecutive U’s lie 
outside the given range, namely (1 — p)”. The probability of a gap of length exactly r 
is the probability for length > r minus the probability for length > (r + 1). 


5. As N goes to infinity, so does n (with probability 1), hence this test is just the 
same as the gap test described in the text except for the length of the very last gap. 
And the text’s gap test certainly is asymptotic to the chi-square distribution stated, 
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since the length of each gap is independent of the length of the others. [Notes: A quite 
complicated proof of this result by E. Bofinger and V. J. Bofinger appears in Annals 
Math. Stat. 32 (1961), 524-534. Their paper is noteworthy because it discusses several 
interesting variations of the gap test; they show, for example, that the quantity 


3 (Y, — (Np) pr)” 
seen  Wp)pr 
does not approach a chi-square distribution, although others had suggested this statistic 
as a “stronger” test because Np is the expected value of n.] 

7. 5, 3, 5, 6, 5, 5, 4. 

8. See exercise 10, with w = d. 

9. (Change d to w in steps C1 and C4.) We have 


= Het) dw) froth ie ince: 
r d” w-1J’ == , 


pai 2, (artaal a D 


10. As in exercise 3, we really need consider only the case n = 1. The generating 
function for the probability that a coupon set has length r is 


a= aE = (GE) 


by the previous exercise and Eq. 1.2.9-(28). The mean and variance are readily 
computed using Theorem 1.2.10A and exercise 3.4.1-17. We find that 


d d 
mean(G) = w 4 (5 1) free (n 1) = d(Ha Ha-w) = H; 


var(G) = (HP — HP ) — d(Ha — Haw) = 0°. 


The number of U’s examined, as the search for a coupon set is repeated n times, 
therefore has the characteristics (min wn, ave un, max oo, dev oyn). 

lıļl2ļ9 8 5 3ļel7 ola]. 
12. Algorithm R (Data for run test). 


R1. [Initialize.] Set j 4 —1, and set COUNT[1] <— COUNT[2] + --- < COUNT[6] < 0. 
Also set Un <- Un—1, for convenience in terminating the algorithm. 


R2. [Set r zero.] Set r + 0. 
R3. [Is Uj < Uj+1?] Increase r and j by 1. If Uj < Uj+41, repeat this step. 


R4. [Record the length.] If r > 6, increase COUNT[6] by one, otherwise increase 
COUNT[r] by one. 


R5. [Done?] If j <n — 1, return to step R2. J 


13. There are (p+q+ 1) (79) ways to have Uj-1 2 Ui < -++ < Uitp-1 Z Uitp < < 
pathy a ad 
1 


Ui+p+q-1; Subtract ( for those ways in ich Ui-1 < Ui, and subtract ( 
for those in which nee 1 < Ui+p; then add in 1 for the case that both Uj_1 < U; and 
Ui+p—1 < Ui+p, since this case has been subtracted out twice. (This is a special case 
of the inclusion-exclusion principle, which is explained further in Section 1.3.3.) 


564 ANSWERS TO EXERCISES 3.3.2 


14. A run of length r occurs with probability 1/r!—1/(r+1)!, assuming distinct U’s. 
Therefore we use pr = 1/r!— 1/(r + 1)! for r < t and p = 1/t! for runs of length > t. 


15. This is always true of F(X) when F is continuous and X has distribution F; see 
the remarks following Eq. 3.3.1—(23). 


16. (a) Ze = max(Zji1-1), Z+) @-1))- If the Zjæ-ı) are stored in memory, it is 
therefore a simple matter to transform this array into the set of Zj: with no auxiliary 
storage required. (b) With his “improvement,” each of the V’s should indeed have the 
stated distribution, but the observations are no longer independent. In fact, when Uj 
is a relatively large value, all of Zjt, Z(j~-1)1, ---, Zj—t41)t will be equal to Uj; so we 
almost have the effect of repeating the same data ¢ times (and that would multiply V 
by t, as in exercise 3.3.1—10). 


17. (b) By Binet’s identity, the difference is J o<p<jen(UkVj — U V})?, and this is 
certainly nonnegative. (c) Therefore if D? = N°, we must have U,V; — U; Vy = 0, for 
all pairs j,k. This means that the matrix 


Ug Ul... Ukay 

Vo Vio... Va 
has rank < 2, so its rows are linearly dependent. (A more elementary proof can be 
given, using the fact that UjVj — U;Vg = 0 for 1 < j < n implies the existence of 
constants a, 8 such that aU; + 8V} = 0 for all j, provided that Uj and Vg are not both 
zero; the latter case can be avoided by a suitable renumbering.) 
18. (a) The numerator is —(Uo — U1)”, the denominator is (Uo — U1)”. (b) The nu- 
merator in this case is —(U + U? + UZ — UoUı — U1U2 — U2Uo); the denominator 


is 2(U§ +---—U2Uo). (c) The denominator always equals D osjeran UI- Ux)”, by 
exercise 1.2.3-30 or 1.2.3-31. E 
19. The stated result holds, in fact, whenever the joint distribution of Uo, ..., Un—1 


is symmetrical (unchanged under permutations). Let Sı = Uo +--+: + Un-1, S2 = 
U? +--+ U21, X = UoUi +--+ + Un-2Un-1 + Un-1U0, and D = nS2 — S?. Also 
let E f(Uo,...,Un-1) denote the expected value of f(Uo,...,Un—1) subject to the 
condition D Æ 0. Since D is a symmetric function, we have E f(Uo,...,Un-1) = 
E f(Up(0),---,Up(m—1)) for all permutations p of {0,...,n — 1}. Therefore ES2/D = 
nEU¢/D, ES?/D = n(n — 1)E(UoU1/D) + nEUG/D, and EX/D = nE(UoU,/D). 
It follows that 1 = E (nS2 — 92)/D = —(n — 1) E(nX — S?)/D. (Strictly speaking, 
E S2/D and ES?/D might be infinite, so we should be careful to work only with linear 
combinations of expected values that are known to exist.) 


20. Let E1111, E211, E22, £31, and E4 denote the respective values E(UpU1U2U3/D?), 
E(U@U1U2/D?), E(U@U?/D?), E(U§Ui/D?), E(U¢/D?). Then we have ES$3/D? = 
n(n—1)E22+n E4, E(S2S97/D?) = n(n—1)(n—2) E211 +n(n—1) F22+2n(n—1) E31 +nE14, 
ES{/D? = n(n — 1)(n — 2)(n — 3) Ein + 6n(n — 1)(n — 2) Ea + 3n(n — 1) E22 4 
4n(n — 1)F31 + nEs, E X?/D? = n(n — 3) Fi + 2nEo11 + nE22, E(XS?/D?) = 
n(n — 2)(n — 3)E1111 + 5n(n — 2) Fait + 2nE22 + 2nE31, E((Uo — U1)*/D?) = 6E22 — 
8 E31 + 2E4, and the first result follows. 

Let ô = a((Inn)/n)'/?, M = a?/2 + 1/3, and m = [1/6]. If we divide the 
range of the distribution into m equiprobable parts, we can show that each part will 
contain between nô(1 — 5) and né(1 + ô) points, with probability > 1 — O(n~™), 
using the tail inequalities 1.2.10-(24) and (25). Hence, if the distribution is uniform, 
D = 4n’(1+ O(6)) with at least this probability. If D is not in that range, we have 
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0 < (Uo — U;)4/D? < 1. Since EUo —U,)*) = So Jo ( z — y) dzrdy = $, we may 
conclude that E((Uo — U1)*/D*) = 8n74(1 + O(6)) + O(n™). 

Note: Let N be the numerator of (23). When the variables all have the normal 
distribution, W. J. Dixon proved that the expected value of eW’N+7?)/™ ig 


(1 — 2z — 2w)!/2(1 — 22 + \/(1 — 22)? — 4w?) 7”? + O(w"). 


Differentiating with respect to w and integrating with respect to z, he found the 
moments E(N/D)?*—1 = (— L)E/(n — 4)", E(N/D)?** = (+ a Cee Lyk when n > 2k. 
In particular, the variance E this case is exactly 1/(n + 1) — 1/(n— 1)? . [Annals of 
Math. Stat. 15 (1944), 119-144.] 


21. The successive values of cr—ı = s — 1 in step P2 are 2, 3, 7, 6, 4, 2, 2, 1, 0; hence 
f = 886862. 


22. 1024 = 6!4+2-5!4+2-4!42-3!42-2!+40-1!, so we want the successive values 
of s—1 in step P2 to be 0, 0, 0, 1, 2, 2, 2, 2, 0; working backwards, the permutation 
is (9,6, 5,2,3,4,0,1, 7,8). 


23. Let P'(x1,..., £4) = $ ke, ...;Yn4e-1) = (a1,.--,24)]. Then we have 


Q(@1,..-, 2) = 5 Piet P((£1 — yı) mod d, E (z — yt) mod d); 


(Y1 Yt) 


more compactly, Q(x) = >0, P'(y)P(x — y). Hence, using the general inequality 
(EX)? < EX’, we have D, (Q(x) — d) = D,E, P’w)(P(e—y)-4™))? < 
Da Xy PUPE = y) = P= E, P'U) EP) = ey = Ere) aY. 
[See G. Marsaglia, Comp. Sci. and Statistics: Symp. on the Interface 16 (1984), 5-6. 
The result is of interest only when d’ < 2A, since each P(x) is a multiple of 1/A.] 


24. Write k : a and a: k for the first k and last k elements of string a. Let K (a, 8) = 
[a = 8]/P(a), and let C be the d’ x d* matrix with entries Gag = K(a,8)-—K(t—-1:a, 
t— 1:8). Let C be the covariance matrix of the random variables N(a) for |a| = t, 
divided by n. These variables are subject to the constraint y t N(aa) = Y i N(aa) 
for each of d*™} strings a, and we also have Dialer V(a) = n; but all other linear 
constraints are derivable from these (see Theorem 2.3.4.2G). Therefore C has rank 
dt — d‘~', and by exercise 3.3.1-25 we need only show that CCC = C. 

It is not difficult to verify that cag = P(aB) J i,) <4 Tk(@, 8), where Ty(a, 8) is a 
term corresponding to the overlap that might occur when we superimpose ĝ on a@ and 
slide it k positions to the right: 


_ [ K(t+k:a, B:t+k)-1, ifk <0; 
ON ee if k >0. 


For example, if d = 2, t = 5, a = 01101, and 8 = 10101, we have cag = P(0)*P(1)° x 
(P(01)~' + P(101)™* + P(1)™* — 9). Entry a8 of CCC is therefore P(a3) times 


S> E Pha) SP Tela, ya)(K (a, 8) — 1) TiC, 6). 


|y|=t-1 a,b=0 |k|<t |l|<t 


Given k and l, the product Ti.(a, ya)(K(a, b) —1)Ti(7b, 8) expands to eight terms, each 
of which usually sums to +1 when multiplied by P(yab) and summed over all yab. For 
example, the sum of P(yab)K(2: a, ya: 2)K(a,b)K(3: yb, 8:3), when a= a1... at, 
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B = bi... bi, y=c1...ce-1, and t > 5, is the sum of P(c4...cz-2), which is 1. If t = 4, 
the same sum would be K’(a1, 64), but it would cancel with the sum of P(yab)K(2: a, 
ya : 2)(—1)K(3: yb, B : 3). The net result is therefore 0 unless k < 0 < l; otherwise it 
turns out to be K(i: (a: i—k),i:(8:i+1)-—K(i-1:(a:i-k), i-1:(68:i+D), 
where i = min(t + k, t — l). The sum over k and l telescopes to cag. 

25. Empirical tests show, in fact, that when (22) is generalized to arbitrary t the ratios 
of corresponding elements of CI 1 and CICCI! are very nearly —t, when t > 5. For 
example, when t = 6 they all lie between —6.039 and —6.111; when t = 20 they all lie 
between —20.039 and —20.045. This phenomenon demands an explanation. 


26. (a) The vectors (S1,...,5,) are uniformly distributed points in the (n — 1)- 
dimensional polyhedron defined by the inequalities Sı > 0, ..., Sn > 0 in the hyper- 
plane Sı +---+ Sn = 1. An easy induction proves that 


co co co | ee Seite A 
f an f ats. f Pte eS — a 
sı s2 Snai (n—1)! 


To get the probability, divide this integral by its value in the special case sı = --- = 
Sn = 0. [Bruno de Finetti, Giornale Istituto Italiano degli Attuari 27 (1964), 151-173.] 
(b) The probability that Sq) > s is the probability that S1 > s,..., Sn > s. 

(c) The probability that S) > s is the probability that at most k — 1 of the 
Sj are < s; hence 1 — F(s) = Gi(s) + --- + Gk-1(8), where G;(s) is the probability 
that exactly j spacings are < s. By symmetry, G;(s) is (5) times the probability that 
Si <s,..., Sj < 8, Sj+1 È ,..., Sn > s; and the latter is Pr(S1 < s,...,S5j;-1 < s, 
S; > 0, Sj+1 È S,..., Sn > s)—Pr(S1 < s,..., Sj—1 < 8, Sj >s,...,Sn > 8). Repeated 
application of (a) shows that G;(s) = C) X, O17 (1 (n — I)s)" hence 


oe zo (aa yea EO ie 


In particular, the largest spacing S(n) has distribution 
n\ (n—-l-1 co E n=i n na 
Fa(s) = 1-5( i ) (CD Hip = oe | Jepa ae 
l l 


[Incidentally, the similar quantity 2”~'(n — 1)! 7+ Fa (x+) turns out to be the density 
function for the sum U; + -- -+ Un of uniform deviates.] 

(d) From the formulas Es” = rha — F(s))s’~'ds and IA s”(1 — ks)?! ds = 
kinm ("Y"), we find E Sœ) = n7"(Hn—Hn—x) and, with a bit of algebra, E Sfp) = 
n(n + DHP — HË, + (Hn — Hn—k)*). Thus the variance of S(x) is equal to 
n(n +1)7(HO) — HO, — (Hn — Hn—x)?/n). 

[The distributions F(s) were first found by W. A. Whitworth, in problem 667 of 
DCC Exercises in Choice and Chance (Cambridge, 1897). Whitworth also discovered 
an elegant way to compute the expected value of any polynomial in the functions 
Gx(s) = Fe(s) — Fe41(s); this was published in a booklet entitled The Expectation of 
Parts (Cambridge, 1898), and incorporated into the fifth edition of Choice and Chance 
(1901). Simplified expressions for the mean and variance and for a variety of more 
general spacing statistics were found by Barton and David, J. Royal Stat. Soc. B18 
(1956), 79-94. See R. Pyke, J. Royal Stat. Soc. B27 (1965), 395-449, for a survey of 
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the ways in which statisticians have traditionally analyzed spacings as clues to potential 
biases in data.] 


27. Consider the polyhedron in the hyperplane Sı +---+ Sn = 1 defined by the 
inequalities S; > 0,..., Sn > 0. This polyhedron consists of n! congruent subpolyhedra 
defined by the ordering of the S’s (assuming that the S’s are distinct), and the operation 
of sorting is an n!-to-1 folding of the large polyhedron to the subpolyhedron in which 
Si < +++ < Sn. The transformation that takes (S(1),...,S(n)) to (Si,...,S1) is 
a 1-to-1 mapping that expands differential volumes by the factor n!. It takes the 
vertices (3, sony +), (0, +, bass =), ..., (0,...,0,1) of the subpolyhedron into the 
respective vertices (1,0,...,0), (0,1,0,...,0),..., (0,...,0,1), linearly stretching and 
distorting the overall shape in the process. (The Euclidean distance between vertices 
(0,...,0, a rs +) and (0,...,0, i, sey i) in the subpolyhedron is |j~' — k~+|*/?; the 
transformation produces a regular simplex in which all n vertices are v2 apart.) 


The behavior of iterated spacings 
is easiest to understand if we examine 
the details graphically when n = 3. 
In this case the polyhedron is simply 
an equilateral triangle, whose points 
are represented with barycentric coor- 
dinates (x,y,z), x + y +z = 1. The 
accompanying diagram illustrates the 
first two levels of a recursive decom- 
position of this triangle. Each of the 
6? subtriangles has been labeled with 
a two-digit code pq, where p repre- 
sents the applicable permutation when 
(x,y,z) = (S1, 82,83) is sorted into 
(Sa), Se), S(3y), and q represents the 
permutation in the next stage when S{, 
S}, and S% are sorted, according to the following code: 


(0, 0, 1) 


O:r<y<z, lia<z<y, 2y<ar<z, 3:y<z<4, 4:2<u<y, 5: 2z<yK<aQ. 


For example, the points of subtriangle 34 have S2 < S3 < Sı and S$ < Si < Sb. 
We can continue this process to infinitely many levels; all points of the triangle with 
irrational barycentric coordinates thereby acquire a unique representation as an infinite 
radix-6 expansion. A tetrahedron can be subdivided similarly into 24, 247, 24°, ... 
subtetrahedra, and in general this procedure constructs a radix-n! expansion for the 
points of any (n — 1)-dimensional simplex. 

When n = 2 the process is especially simple: If x ¢ {0, oT 1}, the transforma- 
tion takes spacings (x, 1 — x) = (x,y) into either (2x mod 1, 2y mod 1) or (2y mod 1, 
2x mod 1), depending on whether x < y or x > y. Repeated tests therefore essentially 
shift the binary representation left one bit, possibly complementing the result. After at 
most e+1 iterations on e-bit numbers the process must converge to the fixed point (0,1). 
Permutation coding in the case n = 2 corresponds simply to folding and stretching a 
line; the first four levels of subdivision have the following four-bit codes: 

(0, 1) | 1 | n | n | \ | (1,0) 


T | 
0000 0001 0011 0010 0110 0111 0101 0100 1100 1101 1111 1110 1010 1011 1001 1000 
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This sequence is exactly the Gray binary code studied in Section 7.2.1. In general, the 
radix-n! permutation code for an n-simplex has the property that adjacent regions have 
identical codes except in one digit position. Each iteration of the spacing transformation 
shifts off the leftmost digit of the representation of each point. Note that equal birthday 
spacings are points near the boundary of the first-level decomposition. 

This fundamental transformation from (S1,..., Sn) to (S{,...,S%,) is implicit in 
Whitworth’s proof of Proposition LVI in the fifth edition of Choice and Chance (see 
the reference in answer 26). It was first studied explicitly by J. Durbin [Biometrika 
48 (1961), 41-55], who was inspired by a similar construction of P. V. Sukhatme 
[Annals of Eugenics 8 (1937), 52-56]. The permutation coding for iterated spacings 
was introduced by H. E. Daniels [Biometrika 49 (1962), 139-149]. 


28. (a) The number of partitions of m into n distinct positive parts is pn(m — eer 
by exercise 5.1.1-16. These partitions can be permuted in n! ways to yield n-tuples 
(y1,---;Yn) with 0 = yı < yo < +++ < Yn < m; and each of these n-tuples leads to 
(n— 1)! n-tuples that have yı = 0 and 0 < y2,...,Yn < Mm. Now add a constant mod m 
to each yj; this preserves the spacings. Hence bnoo(m) = mn! (n — 1)! pn(m — eu, 

(b) Zero spacings correspond to balls in the same urn, and they contribute s — 1 
to the count of equal spacings. Therefore bnrs(m) = {,,",} b—s)(r41—s)o(™)- 


(c) Since {,",} = (5), the probability is 


monn" (mm (43) bran (8) 


29. By the previous answer and exercise 5.1.1—-15 we have bno(z) = n! (n — 1)! zC) 
(1— z)... (1— 2z”). When r = 1, the n! in our previous derivation becomes n!/2, and 
the number of solutions to 0 < s1 < -++ < Sk < Sk41 < `: < Sn With sı +---+5, =m 
is the number of solutions to 0 < sı —1 < --- < sk — k < sky —K< +++ <sn-n++l 
with (sı — 1) +--+ (sk — k) + (Seq1 — k) + + (sn —n +1) =m-— (5) — k. Hence 
bni(z) = $n! (n—1)! Dg (z* - 2”) PEIG —z)...(1— z”). A similar argument shows 
that 


bn2(z 1 a oe E ee ae 
fat = (ss > (=e = TA 31 5 (2"—z2")(2"—z )) 


1<j<k<n “1<k<n 


We can obtain bnr(z) for general r from the formula 


op onr (zw S (z — biz”)... (2°71 — bnz") ( w Te e 


n! (n — 1)!z” ow T C1 ...Cn—1 (1 — z)... (1 — 2”) zi 


gn-l 


where cy = 1 + bk + bkbk—1 +--+ + by... b2b1 = 14 bkck—1ı. (The special case w = 1 is 
interesting because the left side sums to (1 — z)~"/n! in that case.) 


30. This is a good problem for the saddle point method [N. G. de Bruijn, Asymp- 
totic Methods in Analysis (North-Holland, 1961), Chapter 5]. We have p,(m) = 
ay f ef), where f(z) = —mInz— Ð} In(1 — 2"). Let p = n/m and 6 = /n/m; 


2ri 
+itô 6 p7/8 


integrating on the path z = e`’ gives pn(m) = x ka exp(f(e-?t)) dt. Tt is 
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convenient to use the identity 


j , ton 
(se) = > = g(s) +f ot 19(se'") du, 
0 


n! 


where g = g(z) is any analytic function and V is the operator zt, When the function 
Vg is evaluated at e? the result is the same as when g(e*) is differentiated j times with 
respect to z. This principle leads to the formula 


ae jln pis WB ay 
PH = mjs +E + AY Oe, 


j 
k=1 1>j 


because of another handy identity, 


l=e * Banz” 
l = ; 
n( z ) ps n-n! 


n>1 


Therefore we obtain an asymptotic expansion of the integrand, 


a fOr j = = e` . ‘i 
exp f(e p+ #2) = (I 1E vif(e ») aye t?/2+f (e7?) explicit — cot? — icat? +- = 


j20 


where cy = (ner) By + mint /2) nt) B2p)ô 4 O(n), etc.; and it turns out that 
cj = O(n?) for j > 3. Factoring out the constant term 


ô fle”) ô = Bı Lil 
= —k 
Qn” 2r n! pre-™P R 3 2 Ee? 


k=11>1 


© Tn | 10368n? 


E Jam tenta/4 (1 18a— a? 108a? — 36a? + af otn) 


Qn nln” 


leaves us with an integral whose integrand is exponentially small when |t| > n*. We can 
ignore larger values of t, because partial fraction expansion shows that the integrand 
is O((m/n)"/?); none of the other roots of unity occurs more than n/2 times as a 
pole of the denominator. Hence we are allowed to “trade tails” [CMath, §9.4] and 
integrate over all t. The formulas ~~. et /245 dt = (j — 1)(j — 3)... (1)V2r [j even] 
and n! = (n/e)” V2rn exp(Șn7} + O(n~*)) suffice to complete the evaluation. 

With gn(m) = pn(m— ("2*)) in place of pn(m) the calculation proceeds in the 
same way but with cı increased by a(n” 2 — n™!/2) and with the additional factor 
exp(—p("3")). We get 


m” leet 13a? 169a* — 20162? — 1728a? + 414724 j O(n™?) 
n! (n— 1)! 288n ` 16588872 


qnm) = 


this matches the formula for p,(m) except that a has been changed to —a. (In fact, 
if we define pn(m) = rn(2m + (ae) and gn(m) = rn(2m — EEA), the generating 
function Ra(2) = E ra(2™) = [Ip (27" — 2") satisfies Ra(1/2) = (—1)” Rn (2). 
This implies a duality formula r,,(—m) = (—1)"~'rn(m), in the sense that this equation 
is identically true when we express rn(m) as a polynomial in m and roots of unity. 
Therefore we may say that gn(m) = prn(—m). A general treatment of such duality 


can be found in G. Pólya, Math. Zeitschrift 29 (1928), 549-640, §44.) For further 
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information see G. Szekeres, Quarterly J. Math. Oxford 2 (1951), 85-108; 4 (1953), 
96-111. 
ae exact value of gn(m) when m = 27° and n = 512 is 7.08069 34695 90264 


094.. 101514; our approximation gives the estimate 7.080693501 x 101514, 

The probability that the birthday test finds R = 0 spacings is bnoo(m)/m” = 
n!(n—1)!m!~"qn(m) = e7™4 + O(n™'), by exercise 28, because the contribution 
from bnoi(m) is ~ aea = O(n"). Inserting the factor gn(z) = rai (@* — 1) 
into the integrand for qn(m) has e effect of ane ane the result ae + O(n"), 
because gn(e~?*”?) = (2)p + O(n? p) + itO(n7d) — $t?O(n76) + _ Similarly the 
extra factor igea —1)(z A — 1) essentially multiplies oe np? = Za”, plus 


O(n~'); other contributions to the probability that R = 2 are O(n ) Tr In “this way 
we find that the probability of r equal spacings is e~°/4(a/4)"/r! + O(n—), a Poisson 
distribution; more complicated terms arise if we carry the expansion out to O(n’). 


31. The 79 bits consist of 24 sets of three, {Yn, Yn+31, Yn+55}, {Yn+1, Yn+32, Yn+se}, 

, {Yn+23, Yn+54, Yn+78}, plus 7 additional bits Yn+24, ..., Yn+30. The latter bits are 
equally likely to be 0 or 1, but in each group of three the probability is ; that the bits 
will be {0,0,0} and 3 that they will be {0,1,1}. Therefore the probability generating 


function for the sum of bits is f(z) = Cae a polynomial of degree 55. 
(Well, not quite; strictly speaking, it is (2°° f(z) — 1)/(2°° — 1), because the all-0 case 
is excluded.) The coefficients of 2°° f(z) are easily computed by machine, and we find 
that the probability of more 1s than Os is 18509401282464000/(2°° — 1) = 0.51374. 

Notes: This exercise is based on the discovery by Vattulainen, Ala-Nissila, and 
Kankaala [Physical Review Letters 73 (1994), 2513-2516] that a lagged Fibonacci 
generator fails a more complicated two-dimensional random walk test. Notice that the 
sequence Yon, Yon+2, ... will fail the test too, because it satisfies the same recurrence. 
The bias toward 1s also carries over into the subsequence consisting of the even- 
valued elements generated by Xn = (Xn—55 + Xn—24) mod 2°; we tend to have more 
occurrences of (...10)2 than (...00)2 in binary notation. 

There’s nothing magic about the number 79 in this test; experiments show that a 
significant bias towards a majority of 1s is present also in random walks of length 101 or 
1001 or 10001. But a formal proof seems to be difficult. After 86 steps the aang 
function is ee y T(t kaet at )’; then we get the factors (1 + 227 + 52° + 524 4 
1025+82 +27) /32; then (14 2274 +723 4724415294252 42927 +2828 413294 21°) /128, 
etc. The analysis becomes more and more complicated as the walks eet longer. 

Intuitively, the preponderance of 1s that arise in the first 79 steps ought to persist 
as long as the subsequent numbers are reasonably balanced between 0 and 1. The 
accompanying diagram shows the results of a much smaller case, the generator Yp = 
(Yn-2+Yn-11) mod 2, which is easy to analyze exhaustively. In this case random walks 
of length 445 have a 64% chance of finishing to the right of the starting point; this bias 
disappears only when the length of the walk increases to half the period length (after 
which, of course, 0s are more likely, although the full period does lack one 0). 


0.7 

0.6 a 

i ae omens 
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m=0 256 512 768 1024 1280 1536 1792 2047 


The probability that 1s outnumber Os in random m-tuples when Yn = Yn—2 ® Yn-11. 
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Lüscher’s discarding technique can be used to avoid the bias toward 1s (see the 

end of Section 3.2.2). For example, with lags 55 and 24, no deviation for randomness is 
observed for random walks of length 1001 when the numbers are generated in batches 
of 165, if only the first 55 numbers of each batch are used. 
32. Not if, say, X and Y each take the values (—n, m) with the respective probabilities 
(m/(m+n),n/(m+n)), where m < n < (1 + v2)m. [Suppose two competitors differ 
by X after playing one round of golf. Then they are of equal strength based on their 
mean scores, but one might be more likely to win a one-round tournament while the 
other will more often win in two rounds. See T. M. Cover, Amer. Statistician 43 (1989), 
277-278, for a discussion of similar phenomena.] 


33. We essentially want [z(*+!1=0/2] a eae z). Let m = k — 2l ae 
n = l; the desired coefficient is 4, $ e° wie =) where g(z) = mIn(+42)+nIn( 432) 


(#+22=1) In z. It is convenient (and saddle-wise) to integrate along the path z = e" 


where e = 4/(m + 3n) and u = —1 + it for —oo < t < œ. We have g(e™) = 
eu/2+u?/2+c3eu? 4 see }+++, where cp = €70*g(1)/k! = O(1). Also 1/(1 — e“) = 
= + 3 — Boeu/2! —.---. gr bis out the integrand and using the facts that 


Lf ee eat ed oS. es et /2y2* du = (—1)*(2k — 1) (2k — 3)... (1) V27 
yields the asymptotic formula 4 + (27)~ 2n(m + 3n)? + O((m + 3n)~*/?). If 
m + 3n is even, the same asymptotic formula holds, provided that we give half of the 
coefficient of z™+3™/2 to the 1s and half to the 0s. (This coefficient is Ce + 
O((m—3n)~*/?).) 

34. The number of strings of length n that exclude a given two-letter substring or pair 
of substrings is the coefficient of z” in an appropriate generating function, and it can 


be written ce”™m” + O(1) where c and 7 have series expansions in powers of €e = 1/m: 


Case Excluded Generating function c T 

1 aa (1+z)/p(z) 1+e7—268 e+e- že 

2 ab 1/(1-mz+2°) Ipe H3 --- e- 3é+ 

3 aa,bb  (1+2z)/(plz)+2°) 142e? 4e e 2P 426% —8et+ --- 
4 aa, bc (1+2z)/(p(z) +27 +23) 1426723 +--. 27? +The 
5 abbe = (1+2)/(l—mz+2z2?—23) 142e?-2e 4+... 27° tbtt 
6 abcd = 1/(1—mz+2z?) 1+2 412644 +... —2e?-6e4*+--- 


(Here a,b, c,d denote distinct letters and p(z) = 1 — (m — 1)(z + 2°). It turns out that 
the effect of excluding {ab, ba} or {aa, ab} is equivalent to excluding {aa, bb}; excluding 
{ab, ac} is equivalent to excluding {ab, cd}.) Let Sr O) be the coefficient of z” in Case j 
and let X be the total number of two-letter combinations that do not appear. Then 
EX = (mS&? + m25)”)/m” and 


EX? = (mS& +m2(s +682) +2m2(s® + SP +5) + mits) /m 


35. (a) E Sm = NIDN Do Zari HN Os ce Zn+; = m/N, because 
yo 232 Se i 1)=1. 

(b) Let €* = ai?! + --- + ax, and define the linear function f as in the 
first solution to exercise 3.2.2-16. Then Y, = f(€"), and it follows that Yn+i + 
Yng = FE") + FEH) = f(E + Et) = f(a) (modulo 2), where a is 


aa uaring Hence E S?, = N DD p DODN d Znti Zaj = 


=Po : ee: : Zvi a 2 Sen Daa Zn) =m — m(m = 1)/N. 
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(c) Et a Zati = o E Zn4j = 0 and BO o aude) =}, ; E Zati 
X o<i<jem (E Zn+:)(E Zn+;) = m when each Zn is truly random. Thus the mean m 


variance of Sm are very close to the correct values when m < N. 
(d) E S$, = NIETS oDe m inn Zn4hZn+iZn+;. If any of h, i, or j 
are equal, the sum on n is 1; hence 


N-1 
1 
E Sh =W (r —m2+6 5 5 ZasnZntiZnss ): 


O<h<i<j<m n=0 


Arguing as in (b), we ind that the sum on n will be 1 if €"+é'+& Æ 0; otherwise it will 
be —N. Thus E Sh = më —6B(N+1)/N, where B = Docpeicjeml6’ + EË +E = 0] = 

D gaij lA E + E&I =0] (m — j). Finally observe that 1 + ¿t = & in the field if and 
only if f(€'*') = f(€7*") for 0 < l< k, assuming that 0<i< j <N. 

(e) The only nonzero term occurs for i = 31 and j = 55; hence B = 79 — 55 = 24. 
(The next nonzero term occurs when i = 62 and j = 110.) Ina truly random situation, 
E S3, should be zero, so this value E S3% ~ —144 is distinctly nonrandom. Curiously it 
is negative, although exercise 31 showed that S79 is usually positive. The value of S79 
tends to be more seriously negative when it does dip below zero. 

Reference: IEEE Trans. IT-14 (1968), 569-576. Experiments by M. Matsumoto 
and Y. Kurita [ACM Trans. Modeling and Comp. Simul. 2 (1992), 179-194; 4 (1994), 
254-266] confirm that trinomial-based generators fail such distribution tests even when 
the lags are quite large. See also ACM Trans. Modeling and Comp. Simul. 6 (1996), 
99-106, where they exhibit exponentially long subsequences of low density. 


SECTION 3.3.3 
1. y((2/y)) + ay — 3y8(2/y). 
2. ((z)) = -X n> -+ sin 2mng, which converges for all x. (The representation in 


Eq. (24) may be considered a “finite” Fourier series, for the case when z is rational.) 
3. The sum is ((2"x)) — ((a)). [See Trans. Amer. Math. Soc. 65 (1949), 401.] 
4. dmax = 2'°- 5. Note that we have Xn+1 < Xn with probability i + €, where 
lel < d/(2-10"°) < 1/(2- 5°); 
hence every potency-10 generator is respectable from the standpoint of Theorem P. 


5. An intermediate result: 


6. (a) Use induction and the formula 


hj +e hj+e-1 i Ls hj+c Ls hjt+te-1 
k k k 2 k 2 k : 
h'j j ki k'j i lef ea 
he) ee eee ent (( k )) (( hk h )) (( h )) hk ` 5°( h ) 
7. Take m = h, n = k, k = 2 in the second formula of exercise 1.2.4—45: 


> -EE-E F-a) e-o. 


0O<j<k 
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The sums on the left simplify, and by standard manipulations we get 


2 
hk — hk Kal a 3 P a(h,k,0) — Ža(k, h,0) + L o(1,k,0) = h’k — hk. 


Since o(1, k,0) = (k — 1)(k — 2)/k, this reduces to the reciprocity law. 
8. See Duke Math. J. 21 (1954), 391-397. 
9. Begin with the interesting identity X724 lkp/r]lkq/r] + E225 lkq/p]lkr/p] + 
aL kr/a\lkp/a| = (p — 1)(q — 1)(r — 1), for which a simple geometric proof is 
possible, assuming that p L q, q L r, andr L p. [U. Dieter, Abh. Math. Sem. Univ. 
Hamburg 21 (1957), 109-125.] 
10. Obviously o(k — h, k, c) = —o(h,k,—c), by (8). Replace j by k — j in definition 
(16), to deduce that o(h,k,c) = o(h,k,—c). 


n o D (A) (CH) = E (EE) (HEY) 0 cos nt 


0<j<dk O<i<d 
0<j<k 


o (AE) = (MED) +f bE): 


12. Since ((4445)) runs through the same values as C) in some order, Cauchy’s 
inequality implies that ø(h, k, c)? < o(h,k,0)?; and o(1,k,0) may be summed directly, 


see exercise 7. 


13. o(h,k,c) 4 eL = z >, (or toT 7 (emod k) — (4 )): 


O0<j<k 


if hh’ = 1 (modulo k). 


14, (235 3.27045) /(27 —1) ~ 2-%. An extremely satisfactory global value, in spite 
of the local nonrandomness! 


15. Replace c° where it appears in (19) by |[c|[e]. 


16. The hinted identity is equivalent to mi = prmr+1 + pr-1Mr+2 for 1 < r < t; this 
follows by induction. (See also exercise 4.5.3-32.) Now replace cj by ))j<p<zbr™Mr+1 
and compare coefficients of 6;b; on both sides of the identity to be proved. 

Note: For all exponents e > 1, a similar argument gives 


à cE 1 a Ch — Chay 
yh ii CS in, 
-bike i 


Cj —C 
1<j<t 1<j<t aS 


17. During this algorithm we will have k = mj, h = mj41, € = Cj, Pp = Pj-1, P = Pj-2, 
s=(-1))*" for j= 1, 2,..., t+ 1. 

D1. [Initialize] Set A + 0, B + h, p + 1, p +} 0, s + 1. 

D2. [Divide.] Set a + |k/h], b 4+ |c/h], r 4 cmod h. (Now a = aj, b = bj, and 
r = C541.) 
D3. [Accumulate.] Set A + A+ (a — 6b)s, B+ B+6bp(c+r)s. If r #0 or c= 0, 


set A + A — 3s. If h = 1, set B + B + ps. (This subtracts 3e(mj+1, cj) and 
also takes care of the )>(—1)7*1/mjmj+1 terms.) 


D4. [Prepare for next iteration.] Set c + r, s + —s; set r + k—ah, k + h, 
h + r; set r — ap + p', p' + p, p +} r. Ifh>0, return to D2. J 
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At the conclusion of this algorithm, p will be equal to the original value ko of k, so 
the desired answer will be A+ B/p. The final value of p’ will be h’ if s < 0, otherwise p’ 
will be kg —h’. It would be possible to maintain B in the range 0 < B < ko, by making 
appropriate adjustments to A, thereby requiring only single-precision operations (with 
double-precision products and dividends) if ko is a single-precision number. 


18. A moment’s thought shows that the formula 


S(h,k,¢,2) = Vo<zcn(L9/k] — LG — 2)/k]) (Chi + ©)/k)) 


is in fact valid for all z > 0, not only when k > z. Writing |j/k| — |[(j — z)/k] = 
2 + ((4)) — ((4)) + 46j0 — $5(442) and carrying out the sums yields 


sto) HE) iranta tnt GE) (ME) 


where d = gcd(h, k). [This formula allows us to express the probability that Xn41 < 
Xn <a in terms of generalized Dedekind sums, given a.] 


19. The desired probability is 


mS aoe - ES) 


=0 


a Ge) ae 


«(25s (2)) (He) a) 


_ßB-aßf—-a' 1 


m m ' 12m 


(E) 
16(282=2) 


(ola, m,c+aa -— a’) — o(a,m,c+aa— b’) 
+o(a,m,c+aß — p’) - o(a,m,c+aß — a’) TFE 


where |e| < 2.5/m. 
[This approach is due to U. Dieter. The discrepancy between the true probability 
and the ideal value 2=* p-a is bounded by D aj/4m, according to Theorem K; 


m 


conversely, by choosing a, 6, a’, B’ appropriately we will obtain a discrepancy of at least 
half this bound when there are large partial quotients, using the fact that Theorem K 
is “best possible.” Note that when a ~ ym the discrepancy cannot exceed O(1/,/m), 
so even the locally nonrandom generator of exercise 14 will look good on the serial test 
over the full period; it appears that we should insist on an extremely small discrepancy. | 


20. Yocn<m|(# — s(x))/m][(s(x) — s(s(z)))/m]/m = Vioceem((e — s(2))/m + 
(((bx-+c)/m)) + 4)((s(@) —s(s(x))) /m-+((a(ba-+e)/m)) +4) /m; and z/m = ((a/m))+ 
5 — 20(a/m), s(x)/m = (((ax + ©)/m)) + 3 — 56 ((ax + ¢)/m ), s(s(2))/m = (((a?a + 
ac +c)/m)) +4 — 46((a?x +ac+c)/m). Let s(x’) = s(s(x”)) = 0 and d= gcd(b, m). 
The sum now reduces to 


H (S1 — S2 + S3 — Sa + S5 — S6 + S7 — Sg + So) 4 ((5)) 


eo A E AA a(C) ry 
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where Sı = o(a,m,c), S2 = o(a?,m,ac+c), S3 = o(ab,m,ac), S4 = o(1,m,0) = 
(m—1)(m — 2)/m, S5 = o(a,m,c), Se = a(b,m,c), S7 = —o(a' — 1,m,a'c), and 
Ss = —a(a'(a’ — 1), m, (a’)°c), if a'a = 1 (modulo m); and finally 


ser D (EEN) 


O<r<m 
= PES, (= =) (( etl )) 
= wd D ; (E) ml sso) (28 a )) 


za (otaa, m, aco) 4 122 ((% )) (2 ))) 


where co = cmodd. The grand total will be near i when d is small and when the 
fractions a/m, (a? mod m)/m, (abmod m)/m, b/m, (a' —1)/m, (a’(a’ —1) mod m)/m, 
((ad) mod m)/m all have small partial quotients. (Note that a’ — 1 = —b +b —---, 
as in exercise 3.2.1.3-7.) 


21. Notice first that the main integral decomposes nicely: 


Intl _ 
s= f a{ax+6} dr = -G sth): foo” is 
z a 


3 2 2 


n 


1 0 2 
1 0 a-i 0 
= +O} dx = so bess +Sa—14t +0) dx = H t . 
s [ oe } dx =sot+s1 Sazi [,, ker ) dx aaa 


Therefore C = (s — ($)*)/(% — (§)”) = (1 — 60 + 66°) /a. 
22. We have s(x) < x in the disjoint intervals [e eae, [= fe giaa ae 
which have total length 


(D atl, 1 
1+ 5 (55) D (= 1+ 5-0-4055. 


0<j<a-1 O0<j<a 


23. We have s(s(x)) < s(x) < x when a is in [%%.. 5&2) and ax + 6 — k is in 


2), for0<j<k<a; or when z is in [=£ ..1) and az + 6 — a is either in 


aa ..9). The desired probability is 


a 


a ae 


[#2 .. £%) for 0 < j < [að] or in 


a 


i a con ec 


0<jSk<a 0<j<|[ad| 
1, 1 @ _ 1 ((a6)([a0) +1-26) | 


which is 4 + (1 — 30 + 36°)/6a+ O(1/a”) for large a. Note that 1 — 30+ 30° > 4, so 0 
can’t be chosen to make this probability come out right. 
24. Proceed as in the previous exercise; the sum of the interval lengths is 


iD ey -py d 


O<j1 <- <jt—1 <a 
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Fig. A—1. Permutation regions 
for the Fibonacci generator. 


Fig. A—2. Run-length regions 
for the Fibonacci generator. 


To compute the average length, let px be the probability of a run of length > k; the 
average is 


`~ ~/a+4 k 2 1 a a a 
m= k aces (4) a-1 
k>1 k>1 
The value for a truly random sequence would be e — 1; and our value is e — 1+ 
(e/2 —1)/a+ O(1/a?). [Note: The same result holds for an ascending run, since we 
have Un > Un+ı if and only if 1 — Un < 1 — Un+1. This would lead us to suspect that 
runs in linear congruential sequences might be slightly longer than normal, so the run 
test should be applied to such generators.] 
25. x must be in the interval [(k + a’ — 0)/a..(k + 6’ — 0) /a) for some k, and also in 
the interval [a..8). Let ko = [aa + 8 — 8’), kı = [aß + 6 — B’]. With due regard to 
boundary conditions, we get the probability 
(kı — ko)(8' — a’) /a + max(0, 8 — (kı + a’ — 0) /a) — max(0, a — (ko + a’ — 0)/a). 
This is (8 — a)(B’ — a’) + €, where |e] < 2(8’ — a’) /a. 
26. See Fig. A-1. The orderings Uı < U3 < U2 and U2 < Us < Uj are impossible; the 
other four each have probability L, 
27. Un = {Fn-1U0 + FnUı1}. We need to have both Fk—-1Uo + FkUi < 1 and FkUo + 
Fk+1Uı > 1. The half-unit-square in which Uo > Uj is broken up as shown in Fig. A-2, 
with various values of k indicated. The probability for a run of length k is 5 ifk=1; 
it is 1/Fk—1 Fk+1 —1/Fr Fk+2, if k > 1. The corresponding probabilities for a random 
sequence are 2k /(k + 1)! —2(k+1)/(k+2)!; the following table compares the first few 
values. 


k: 1 2 3 4 5 

ility i i i >. i 1 1 L i 
Probability in Fibonacci case: 5 3 Gq aq BE 
ilitv i 1 5 dW ç 19 29 
Probability in random case: 3 “Goi ao eo 2n 


28. Fig. A-3 shows the various regions in the general case. The “213” region means 
U2 < Uı < Us, if Uı and U2 are chosen at random; the “321” region means that 
U3 < Uz < Ui, etc. The probabilities for 123 and 321 are + — a/2 + a?/2; the 
probabilities for all other cases are a+ a/4—a?/4. To have all equal to E, we must have 
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a 
= g= = 
d 2 
ae a 
= — 7S 
E 2 2 
y=r2-a 
1 a 
=g- 
4 2 2 
I Q 
= -g= — 
4 2 2 
( 1 a 
ee 
2 2 
213 
a » 
0,0 ,0 1 1,0 
0) (Zo) (9) (ea) (1,0) 


Fig. A-3. Permutation regions for a generator with potency 2; a = (a — 1)c/m. 


1 — 6a + 6a? = 0. [This exercise establishes a theorem due to J. N. Franklin, Math. 
Comp. 17 (1963), 28-59, Theorem 13; other results of Franklin’s paper are related to 
exercises 22 and 23.] 


SECTION 3.3.4 
1. For generators of maximum period, the 1-D accuracy vı is always m, and pı = 2. 


2. Let V be the matrix whose rows are Vi,..., Vi. To minimize Y- Y, subject to the 
condition that Y 4 (0,...,0) and VY is an integer column vector X, is equivalent 
to minimizing (V~'X)-(V~!X), subject to the condition that X is a nonzero integer 
column vector. The columns of V~! are Ui, ..., Ut. 


3. a? = 2a—1 and a? = 3a—2 (modulo m). By considering all short solutions of (15), 
we find that v3 = 6 and vj = 4, for the respective vectors (1, —2, 1) and (1,—1,—1,1), 
except in the following cases: 


m=9, a=4or 7, ve = v2 = 5; 


m=9q, a=3q+lor6g+1, v7 =2. 
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4. (a) The unique choice for (x1, x2) is + (y1u22 Y2U21, —Yyiti2 + y2ui1), and this 
is = + (yru22 + y2au22, —y1 U12 — y2aui2) = (0,0) (modulo 1); that is, xı and x2 are 
integers. (b) When (1,22) Æ (0,0), we have (z1u11 + voua1)? + (£1u12 + £2U22)? = 
aj (uqy + uio) + ©3(ud1 + ubo) + 2arae(uiiu21 + wi2t22), and by hypothesis this is 
> (at + 23 — |x1x2|)(uãï1 + uta) > uti + ute. 

[Note that this is a stronger result than Lemma A, which tells us only that 
r? < (ui, + u?) (uži + ud2)/m? and that «3 < (ut, + uj2)?/m?, where the latter 
can be > 1. The idea is essentially Gauss’s notion of a reduced binary quadratic form, 
Disquisitiones Arithmeticæ (Leipzig: 1801), §171.] 


5. Conditions (30) remain invariant; hence h cannot be zero in step S2, when a is 
relatively prime to m. Since h always decreases in that step, 52 eventually terminates 
with u? +v? > s. Notice that pp’ < 0 throughout the calculation. 

The hinted inequality surely holds the first time step S2 is encountered. The 
integer q’ that minimizes (h’ — q'h)? + (p’ — q'p)? is q! = round ((h'h + p'p)/(h? + p’)), 
by Eq. (24). If (h’ — q'h)? + (p' — q'p)? < h? + p° we must have q' Æ 0, q'  —1, hence 
(p’ —q'p)? > p°, hence (h’— qh)? < h?, i.e., |K —q'h| < h, i.e., q' is qor q+1. We have 
hu+pv > h(h'—q'h)+p(p' — q'p) > —$(h? +p”), so if u? +v? < s the next iteration of 
step S2 will preserve the assumption in the hint. If u? +v? > s > (u—h)?+(v—p)?, we 
have 2|A(u—h)+p(v—p)| = 2(h(h—u)-+p(p—»)) = (u-h)?+(v—p)?-+h?-+p*—(u2 +0?) < 
(u—h)?+(u—p)? < h?+p?, hence (u—h)? + (v—p)? is minimal by exercise 4. Finally 
if both u? + v? and (u — h)? + (v — p)? are > s, let u’ = h' — qh, v' = p' — dp; then 
2|hu' + pu’| < h? +p? <u’? +0”, and h? + p? is minimal by exercise 4. 

[Generalizations to finding the shortest 2-D vector with respect to other metrics 
are discussed by Kaib and Schnorr, J. Algorithms 21 (1996), 565-578.] 


6. If u? +v’ >s>(u—h)?+(v—p) in the previous answer, we have (v — p)? > v?, 
hence (u — h)? < u?; and if q = aj, so that h’ = ajh + u, we must have aj41 = 1. It 
follows that v2 = ming<j<e(m5 +p), in the notation of exercise 3.3.3-16. 

Now we have mo = mjpj + Mj+1Pj-1 = ajMjpj-ı + Mjpj-2 + Mj+1Pj-1 < 
(aj +14 1/aj)mjpj-ı < (A+14+1/A)mjpj-1, and M? + p}_1 > 2mjpj-ı, hence 
the result. 


7. We shall prove, using condition (19), that U; -Ux = 0 for all k Æ j if and only if 
Vj -Vk = 0 for all k # j. Assume that Uj- Up = 0 for all k 4 j, and let U; = a1 Vı + 
----+areV;. Then Uj- Uk = ax for all k, hence Uj = aj Vj, and Vj- Ve = aj ' (U; - Ve) = 0 
for all k # j. A symmetric argument proves the converse. 


8. Clearly 441 < ™% (a fact used implicitly in Algorithm S, since s is not changed 
when t increases). For t = 2 this is equivalent to (mp2/m)'/? > (3mps3/m)*/%, i.e., 


u3 < y m/T u3’. This bound reduces to 410-*/,/m with the given parameters, but 
for large m and fixed u2 the bound (40) is better. 


9. Let f(y,.-., ye) = 9; then gcd(yi,..., yt) = 1, so there is an integer matrix W of 
determinant 1 having (yi,...,yz) as its first row. (Prove the latter fact by induction 
on the magnitude of the smallest nonzero entry in the row.) Now if X = (a1,...,2+) 
is a row vector, we have XW = X’ if and only if X = X'W ~}, and W~* is an integer 
matrix of determinant 1, hence the form g defined by WU satisfies g(x1,..., £4) = 
f(xi,---,2¢); furthermore g(1,0,...,0) = 0. 

Without loss of generality, assume that f = g. If now S is any orthogonal matrix, 
the matrix US defines the same form as U, since (XUS)(XUS)* = (XU)(XU)’. 
Choosing S so that its first column is a multiple of Uf and its other columns are any 
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suitable vectors, we have 


a1 0 0 
a2 
US=]| . 
: U’ 
At 
for some a1, a2, ..., ae and some (t — 1) x (t — 1) matrix U’. Hence f(£1,..., £t) = 


(aizi +: -+arxt) +h(z2, ..., £4). It follows that a1 = vO fin fact, aj = (U1 - U;)/ v0 
for 1 < j < t| and that h is a positive definite quadratic form defined by U’, where 
det U’ = (det U)/ V0. By induction on t, there are integers (£2, ..., £+) with 


h(z£2,..., £4) < (ye aK Idet U|2/-Ygh/t-0)_ 


and for these integer values we can choose x; so that |x + (agr%2+---+arx1)/o1| < i; 
equivalently, (arzı +--+ + at)? < +0. Hence 


0 < f(ai,..., 24) < 40+ (4)??? det UED y9 


and the desired inequality follows immediately. 

[Note: For t = 2 the result is best possible. For general t, Hermite’s theorem 
implies that u < 2’/?(4/3)'¢-/4/(¢/2)!. A fundamental theorem due to Minkowski 
(“Every ¢-dimensional convex set symmetric about the origin with volume > 2’ contains 
a nonzero integer point”) gives us < 2°; this is stronger than Hermite’s theorem for 
t > 9. Even stronger results are known, see (41).| 


10. Since yı and y2 are relatively prime, we can solve u1y2 — u2y1 = m; furthermore 
(uit+qy1)y2—(u2+qy2)y1 = m for all q, so we can ensure that 2 [u1y1 +u2y2| < yitys by 
choosing an appropriate integer q. Now y2(uit+au2) = y2ui—yiu2 = 0 (modulo m), and 
y2 must be relatively prime to m, hence ui + au2 = 0. Finally let |uiy: + u2zy2| = am, 
u? + us = Bm, y? + y3 = ym; we have 0 < a < iy, and it remains to be shown that 
a < $B and By > 1. The identity (uiye u2y1)? + (uryı +u2y2)? = (u? +u5) (y? +92) 
implies that 1+ a? = By. If a > 48, we have 2ay > 1+”, that is, y — Vy? -~I < 
a < $y. But $y < vy? —1 implies that 7? > $, a contradiction. 


11. Since a is odd, yı +y2 must be even. To avoid solutions with yı and y2 both even, 
let yı = z1 + £2, yo = Tı — Xe, and solve x + x = m/V/3 — ce, with 7, L x2 and 
x1 even; the corresponding multiplier a will be the solution to (v2 — xı)a = z2 + 41 
(modulo 2°). It is not difficult to prove that a = 1 (modulo 2**") if and only if xı = 0 
(modulo 25), so we get the best potency when xı mod 4 = 2. The problem reduces to 
finding relatively prime solutions to £? +23 = N where N is a large integer of the form 
4k +1. By factoring N over the Gaussian integers, we can see that solutions exist if 
and only if each prime factor of N (over the usual integers) has the form 4k + 1. 
According to a famous theorem of Fermat, every prime p of the form 4k + 1 can 
be written p = u? +v? = (u + iv)(u — iv), v even, in a unique way except for the signs 
of u and v. The numbers u and v can be calculated efficiently by solving z? = —1 
(modulo p), then calculating u + iv = gcd(x + i, p) by Euclid’s algorithm over the 
Gaussian integers. [We can take z = n®-1)/4 mod p for almost half of all integers n. 
This application of a Euclidean algorithm is essentially the same as finding the least 
nonzero u? + v? such that u + cv = 0 (modulo p). The values of u and v also appear 
when Euclid’s algorithm for integers is applied in the ordinary way to p and z; see J. A. 
Serret and C. Hermite, J. de Math. Pures et Appl. 13 (1848), 12-15.] If the prime 
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factorization of N is pit... pe" = (u1 + ivi)! (u1 —iv1)°!... (ur + tu,) ©" (ur — ivr)", we 
get 2771 distinct solutions to z? + 23 = N, xı L x2, zı even, by letting |v2| + iļxı| = 
(u1 +701) (u2 + ive)... (ur + ivr)"; and all such solutions are obtained in this way. 

Note: When m = 10°, a similar procedure can be used, but it is five times as 
much work since we must keep trying until finding a solution with zı = 0 (modulo 10). 
For example, when m = 10'° we have |m/V3| = 5773502691, and 5773502689 = 
53 - 108934013 = (7 + 2%)(7 — 22)(2203 + 102027) (2203 — 102027). Of the two solutions 
|v2| + iļxı| = (7 + 2i)(2203 + 102027) or (7 + 2%)(2203 — 102022), the former gives 
|x1| = 67008 (no good) and the latter gives |x1| = 75820, |x2| = 4983 (which is usable). 
Line 9 of Table 1 was obtained by taking xı = 75820, £2 = —4983. 

Line 14 of the table was obtained as follows: |2°°//3| = 2479700524; we drop 
down to N = 2479700521, which equals 37 - 797 - 84089 and has four solutions N = 
4364? + 496057 = 263647 + 422457 = 38640? + 314117 = 11960? + 483397. The 
corresponding multipliers are 2974037721, 2254986297, 4246248609, and 956772177. 
We try also N — 4, but it is ineligible because it is divisible by 3. On the other 
hand the prime number N — 8 = 45088? + 21137? leads to the multiplier 3825140801. 
Similarly, we get additional multipliers from N — 20, N—44, N — 48, etc. The multiplier 
on line 14 is the best of the first sixteen multipliers found by this procedure; it’s one 
of the four obtained from N — 68. 

12. U;’ : U;' = Uj : U; + 2 ids qi(Ui : U;) + eer yer qiqk (Ui : Ux). The partial 
derivative with respect to qx is twice the left-hand side of (26). If the minimum can be 
achieved, these partial derivatives must all vanish. 


13. u11 = 1, u21 = irrational, u12 = u22 = 0. 
14. After three Euclidean steps we find v3 = 5? + 5, then S4 produces 


—5 5 0 —2 18 38 
U= | -18 -2 0], V=|]-5 -5 —5 
1 —2 1 0 0 100 
Transformations (j, qı, 92, 93) = (1, *, 0,2), (2, —4, x, 1), (3, 0,0, *), (1, *, 0,0) result in 
—3 1 2 —22 —2 18 
U=|-5 -8 -7], Poles -5 5], Z=(0 0 1) 
1 —2 1 9 —31 29 


Thus v3 = V6, as we already knew from exercise 3. 


15. The largest achievable q in (11), minus the smallest achievable, plus 1, is |wi| + 
--- + |uz| — 6, where 6 = 1 if uiuj < 0 for some i and j, otherwise 6 = 0. For example 
if t = 5, u1 > 0, u2 > 0, ug > 0, u4 = 0, and us < 0, the largest achievable value is 
q = ui + u2 + ug — 1 and the smallest is q = us + 1 = —|us| + 1. 

[Note that the number of hyperplanes is unchanged when c varies, hence the 
same answer applies to the problem of covering L instead of Lo. However, the stated 
formula is not always exact for covering Lo, since the hyperplanes that intersect the 
unit hypercube may not all contain points of Lo. In the example above, we can never 
achieve the value q = ui + u2 + u3 — 1 in Lo if ui + u2 + u3 > m; it is achievable if 
and only if there is a solution to m — u1 — u2 — ug = @1U1 + £2U2 + @3u3 + xalus| in 
nonnegative integers (£1, £2, £3, £4). It may be true that the stated limits are always 
achievable when |u| +----+ |w¢| is minimal, but this does not appear to be obvious.] 


16. It suffices to determine all solutions to (15) having minimum |u1| + --- + [uz], 
subtracting 1 if any one of these solutions has components of opposite sign. 
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Instead of positive definite quadratic forms, we work with the somewhat similar 
function f(x1,..., £t) = |£1U1 +--+ + £U), defining |Y| = |yi| +--+ + |ys|. Inequality 
(21) can be replaced by |ax| < f(yi,---, ye) (maxi<j<e|ve;|). 

Thus a workable algorithm can be obtained as follows. Replace steps 51 through 
S3 by: “Set U + (m), V + (1), r + 1, s + m, t + 1.” (Here U and V are 
1 x 1 matrices; thus the two-dimensional case will be handled by the general method. 
A special procedure for t = 2 could, of course, be used; see the reference following 
the answer to exercise 5.) In steps S4 and S7, set s + min(s,|U;|). In step S7, set 
zk + [maxi<j<z|vzj|s/m|. In step S9, set s + min(s, |Y| — ô); and in step S10, 
output s = N;. Otherwise leave the algorithm as it stands, since it already produces 
suitably short vectors. [Math. Comp. 29 (1975), 827-833.] 

17. When k > t in S89, and if Y-Y < s, output Y and —Y; furthermore if Y-Y < s, 
take back the previous output of vectors for this t. [In the author’s experience preparing 
Table 1, there was exactly one vector (and its negative) output for each r+, except when 
yı = 0 or y = 0.] 

18. (a) Let r = m, y = (1 — m)/3, vij = yt £ôij, wij = —y + dij. Then Vj- Vk = 
i(m? — 1) for j Æ k, Vk: Vk = $ (m? + $), Uj- Uj = $ (m? + 2), zk ~ 2m. (This 
example satisfies (28) with a = 1 and works for all m = 1 (modulo 3).) 

(b) Interchange the roles of U and V in step S5. Also set s + min(s,U;- U;) for 
all U; that change. For example, when m = 64 this transformation with j = 1, applied 
to the matrices of (a), reduces 


43 —21 —21 22 21. 21 
V = | -21 43 —21 |, U= | 21 22 21 
—21 —21 43 21 21 22 
to 
1 1 1 22 21 21 
V = | —21 43 -21 |, U= | -1 1 0 
—21 —21 43 -1 0 1 


[Since the transformation can increase the length of Vj, an algorithm that incorporates 
both transformations must be careful to avoid infinite looping. See also exercise 23.] 


19. No, since a product of non-identity matrices with all off-diagonal elements non- 
negative and all diagonal elements 1 cannot be the identity. 

[However, looping would be possible if a subsequent transformation with q = —1 
were performed when —2V;-V; = V;-V;; the rounding rule must be asymmetric with 
respect to sign if non-shortening transformations are allowed.] 


20. When amod8 = 5, the points 27°(z, s(z),.. ., sl (g)) for x in the period are 
the same as the points 2?~°(y,a(y),...,0°~'(y)) for 0 < y < 2°~*, plus 27°(t,..., 4), 
where o(y) = (ay + |a/4|t) mod 2°? and t = Xo mod 4. So in this case we should use 
Algorithm S with m = 2°~?. 

When amod8 = 3, the maximum distance between parallel hyperplanes that 
cover the points 2~°(x,s(«),...,s~'(«)) modulo 1 is the same as the maximum 
distance covering the points 2~°(x, —s(x),...,(—1)*~!s!*"4(x)), because the negation 
of coordinates doesn’t change distance. The latter points are 2?°™® (y, o(y),-.-,0°~*(y)) 
where o(y) = (—ay — [a/4]t) mod 2°”, plus a constant offset. Again we apply 
Algorithm S with m = 2°~?; changing a to m — a has no effect on the result. 


21. Xana = Xan (modulo 4), so it is now appropriate to let Vi = (4,4a?,4a?)/m, 
V2 = (0,1,0), V3 = (0,0,1) define the corresponding lattice Lo. 


582 ANSWERS TO EXERCISES 3.3.4 


24. Let m = p; an analysis paralleling the text can be given. For example, when 
t = 4 we have Xni3 = ((a? + b)Xn4i + abXn) modm, and we want to minimize 
ut + us +2 + uf 0 such that uw; + bus + abua = u2 + aug 4 (a? bjus = 0 
(modulo m). 

Replace steps $1 through S3 by the operations of setting 


m 0 1 0 1 0 2 
ve (4 ao ve (5 T re (5 a sem, t< 2, 


and outputting v2 = m. Replace step $4 by 


S4’. [Advance t.] If t = T, the algorithm terminates. Otherwise set t + t+ 1 
and R + R(? 2A mod m. Set U: to the new row (—r12, —r22,0,...,0,1) of t 
elements, and set uit < 0 for 1 < i < t. Set V; to the new row (0,...,0,m). 
For 1 < i < t, set q © round((viiri2 + vizr22)/M), Vit H Viiri2 + Vizr22 — qm, 
and U; + Ur + qUi. Finally set s + min(s, Ur- U+), k t, 9 <1. 


[A similar generalization applies to all sequences of length p” — 1 that satisfy the linear 
recurrence 3.2.2-(8). Additional numerical examples have been given by A. Grube, 
Zeitschrift für angewandte Math. und Mechanik 53 (1973), T223-T225; L’Ecuyer, 
Blouin, and Couture, ACM Trans. Modeling and Comp. Simul. 3 (1993), 87-98.] 


25. The given sum is at most twice the quantity J o<p<m/(2a) (dk) = 1 + + f(m/d), 
where T 


1 
f(m) = a 5 csce(rk/m) 
1<k<m/2 
m/2 m/2 
== | cse(ar/m) de + 0(—) = + Intan ( u x) H o(-). 
m Jy m T 2m /|i m 


[When d = 1, we have X o<crem Mk) = (2/7) Inm + 1 + (2/7) n(2e/m) + O(1/m).] 
26. If gcd(q,m) = d, the same derivation goes through with m replaced by m/d. 


Suppose we have m = pj'... py” and gcd(a — 1,m) = pt ... pir and d= p% ... pir. If 
max(0,e1 —f1—d1) o. prO er-fr—dr). 


m is replaced by m/d, then s is replaced by p} 
m/d > 1, we can also replace N by N mod (m/d). 


27. It is convenient to use the following functions: p(x) = 1 if x = 0, p(x) = zx if 
0 < x < m/2, p(x) = m-— zx if m/2 < x < m; trunc(z) = |z/2] if 0 < x < m/2, 
trunc(x) = m — | (m — x)/2] if m/2 < x < m; L(x) = 0 if x = 0, L(x) = |lgz] +1 if 
0< z < m/2, L(x) = —(|lg(m— x)| +1) if m/2 < £ < m; and I(x) = max(1, 217171), 
Note that I(L(x)) < p(x) < 2I(L(x)) and 2p(x) < 1/r(z) = msin(rz/m) < p(x), for 
O<r<m. 

Say that a vector (u1,..., ue) is bad if it is nonzero and satisfies (15); and let pmin be 
the minimum value of p(ui)... p(u+) over all bad (u1,..., uz). The vector (u1, ..., uz) is 
said to be in class (L(w1),..., L(u+)). Thus there are at most (2lg m + 1)* classes, and 
class (L1,..., L¢) contains at most I(L1) ...1(L+) vectors. Our proof is based on showing 
that the bad vectors in each fixed class contribute at most 2/pmin to X` r(ui,..., ut); 
this establishes the desired bound, since 1/pmin < T Tmax. 

Let u = |lgpmin]. The p-fold truncation operator on a vector is defined to be 
the following operation repeated u times: “Let j be minimal such that p(u;) > 1, 
and replace uj by trunc(u;); but do nothing if p(u;) = 1 for all j.” (This operation 
essentially throws away one bit of information about (ui,...,ue).) If (ui,...,u) and 
(uf, ..., uj’) are two vectors of the same class having the same p-fold truncation, we say 


3.3.4 ANSWERS TO EXERCISES 583 


they are similar; in this case it follows that p(u4 — u1)... plus — uy) < 2 < pmin. For 
example, any two vectors of the form ((1z221)2, 0, m—(1a3)2, (101a5x4)2, (1101)2) are 
similar when m is large and u = 5; the p-fold truncation operator successively removes 
£1, £2, T3, L4, Ls. Since the difference of two bad vectors satisfies (15), it is impossible 


for two unequal bad vectors to be similar. Therefore class (L1,..., L+) can contain 
at most max(1,/(L1)...1(Lt)/2") bad vectors. If class (Li,...,L+) contains exactly 
one bad vector (u1,..., ut), we have r(ui,...,Ut) < Tmax < 1/Pmin; if it contains 


< U(L1)...1(Lr)/2" bad vectors, each of them has r(ui,..., ut) < 1/p(ur)... plut) < 
1/l(11) ... (Lt), and we have 1/2" < 2/pmin. 


28. Let ¢ = e7/(™—) and let Sk = Dosjen eis The analog of (51) is 
|Sxo| = Vm, hence the analog of (53) is 


i- 5 wer 


O0<n<N 


= O((Vmlogm)/N). 


The analogous theorem now states that 


t+1 
DO = o( ten ) + O((logm)'rmax),  _D®_, = O((log m)" rmax). 
In fact, DË; < m E r(u1,...,ue) [summed over nonzero solutions of (15)] + 
= >or(us,..-, ue) [summed over all nonzero (u1, .. ., u+)]. The latter sum is O(log m)‘ 
by exercise 25 with d = 1, and the former sum is treated as in exercise 27. 

Let us now consider the quantity R(a) = S>r(wi,...,ue) summed over nonzero 
solutions of (15). Since m is prime, each (ui,..., uz) can be a solution to (15) for at 
most t — 1 values of a, hence ye gem R(a) < ((- 1) EO r(ui,..., ue) = O(t(logm)*). 
It follows that the average value of R(a) taken over all y(m — 1) primitive roots is 
O(t(log m)‘/e(m — 1). 

Note: In general 1/p(n) = O(loglogn/n); we have therefore proved that for all 
prime m and for all T there exists a primitive root a modulo m such that the linear 
congruential sequence (1, a,0,m) has discrepancy DO. = O(m~!T(log m)” log log m) 
for 1 <t<T. This method of proof does not extend to a similar result for linear con- 
gruential generators of period 2° modulo 2°, since for example the vector (1, —3, 3, —1) 
solves (15) for about 2?°/? values of a. 


29. To get an upper bound, allow the nonzero components of u = (u1,..., ut) to be 
any real values 1 < |u;| < 5m. If k components are nonzero, we have r(u) < 1/(2*p(u)) 
in the notation of the answer to exercise 27. And if u? +---+u? has a given value v”, 
we minimize p(w) by taking ui = --- = ur-1 = 1 and uz = v?—k+1. Thus 
r(u) <1/(2*/v? =k +1). But 2*/v? =k +1 > V8v, since v > k > 2. 

30. Let’s first minimize q|aq — mp| for 1 < q < mand 0 < p <a. In the notation 
of exercise 4.5.3-42, we have aqn — Mpn = (—1)"Ks—n-1(Gn42,-.-,@s) forO<n<s. 
In the range gn-1 < q < qn we have jag — mp| > |agn—1 — mpn—1|; consequently 
qlaq — mp| > dn—1|@dn—1 — Mpn—1|, and the minimum is mino<n<s qn|aqn — MPn| = 


mino<n<es Kn(a@1,---,An)Ks—n—1(an42,--.,@s). By exercise 4.5.3-32 we have m = 
Kn(ai, ... , Qn) An41Ks—n-1(an42,---,@s) + Kn(ai,..., Qn) Ks—n—2(Gn43,---,@s) + 
Kn-1(a1,...,@n—1) Ks—n—1(Gn+2,---,@s); and our problem is essentially that of max- 
imizing the quantity m/Kn(a1,...,@n)Ks—n-1(an+2,.--,@s), which lies between an+1 


and an+ı + 2. 
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Now let A = max(aj,...,@s). Since r(m — u) = r(u), we can assume that 
Tmax = r(u)r(au mod m) for some u with 1 <u < im. Setting u’ = min(au mod m, 
(—au) mod m), we have Tmax = T(u)r(u’). We know from the previous paragraph that 
uu’ > qq’, where A/m < 1/qq' < (A+2)/m. Furthermore 2u < r(u)7~! < mu for 
0 < u < dm, SO rmax < 1/(4uu’). Hence we have rmax < (A + 2)/(4m). (There is a 
similar lower bound, namely rmax > A/ (x?m).) 


31. Equivalently, the conjecture is that all large m can be written m = Ky(a1,...,@n) 
for some n and some a; € {1,2,3}. For fixed n the 3” numbers K,,(a1,...,@») have an 
average value of order (1+ v2 )”, and their standard deviation is of order (2.51527)”; so 
the conjecture is almost surely true. S. K. Zaremba conjectured in 1972 that all m can 
be represented with a; < 5; T. W. Cusick made some early progress on this problem 
in Mathematika 24 (1977), 166-172, and an excellent survey of later work has been 
prepared by A. Kontorovich in Bull. Amer. Math. Soc. 50 (2013), 187-228. It appears 
that only the cases m = 54 and m = 150 require a; = 5, and the largest m’s that require 
4s are 2052, 2370, 5052, and 6234; at least, the author has found representations with 
ai <3 for all other integers less than 2000000. When we require a; < 2, the average of 
Kn (a1,--.,4n) is $2” + 2(—2)~”, while the standard deviation grows as (2.04033)”. 
The density of such numbers in the author’s experiments (which considered 2° blocks 
of 214 numbers each, for m < 320) appears to vary between .50 and .65. 

[See I. Borosh and H. Niederreiter, BIT 23 (1983), 65-74, for a computational 
method that finds multipliers with small partial quotients. They have found 2-bounded 
solutions with m = 2° for 25 < e < 35.] 


32. (a) Un — Zn/Mmı = (m2—m1)¥n/mime (modulo 1), and (mı —m2)/mime2 % jopa. 
(Therefore we can analyze the high-order bits of Z,, by analyzing Un. The low-order 
bits are probably random too, but this argument does not apply to them.) (b) We have 
Un = Wn/m for all n. The Chinese remainder theorem tells us that we need only verify 
the congruences W,, = X,mz2 (modulo mı) and Wn = —Y „mı (modulo m2), because 
mı L mg. [Pierre L’Ecuyer and Shu Tezuka, Math. Comp. 57 (1991), 735-746.] 


SECTION 3.4.1 
1. a+ (8 -— qa)U. 


2. Let U = X/m; then [kU] = r =r < kX/m < r+1 Ss mr/k < X < 
m(r + 1)/k = > [mr/k] < X < [m(r +1)/k]. The exact probability is given by the 
formula (1/m)([m(r + 1)/k] — [mr/k]) =1/k + €, where |e| < 1/m. 


3. If full-word random numbers are given, the result will deviate from the correct 
distribution by at most 1/m, as in exercise 2; but all of the excess is given to the smallest 
results. Thus if k ~ m/3, the result will be less than k/2 about 2 of the time. It is 
much better to obtain a perfectly uniform distribution by rejecting U if U > k|m/k|; 
see D. E. Knuth, The Stanford GraphBase (New York: ACM Press, 1994), 221. 

On the other hand, if a linear congruential sequence is used, k must be relatively 
prime to the modulus m, lest the numbers have a very short period, by the results 
of Section 3.2.1.1. For example, if k = 2 and m is even, the numbers will at best be 
alternately 0 and 1. The method is slower than (1) in nearly every case, so it is not 
recommended. 

Unfortunately, however, the “himult” operation in (1) is not supported in many 
high-level languages; see exercise 3.2.1.1-3. Division by m/k may be best when himult 
is unavailable. 
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Al 
Fig. A—4. Region of “acceptance” for 
the algorithm of exercise 6. 
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4. max(Xı, X2) < z if and only if Xı < xz and Xə < x; min(X1, X2) > xz if and only 
if Xı > xz and X2 > x. The probability that two independent events both happen is 
the product of the individual probabilities. 


5. Obtain independent uniform deviates Ui; and Uz. Set X + Us. If Ui > p, 
set X + max(X,U3), where U3 is a third uniform deviate. If U1 > p+ q, also set 
X + max(X,U4), where Us is a fourth uniform deviate. This method can obviously 
be generalized to any polynomial, and indeed even to infinite power series (as shown 
for example in Algorithm S, which uses minimization instead of maximization). 

We could also proceed as follows (suggested by M. D. MacLaren): If Ui < p, set 
X + Uı/p; otherwise if U1 < p +q, set X 4+ max((Ui — p)/q, U2); otherwise set 
X + max((Uı — p — q)/r, U2, U3). This method requires less time than the other to 
obtain the uniform deviates, although it involves further arithmetical operations and 
it is slightly less stable numerically. 


6. F(x) = Aı/(Aı + Az), where Aı and Az are the areas in Fig. A—4; so 


j vl-y dy 2 2 
F(x) = fo = — arcsin z + -rv 1 — 2?. 
T JE 
Ji VIZ? dy 


The probability of termination at step 2 is p = 7/4, each time step 2 is encountered, so 
the number of executions of step 2 has the geometric distribution. The characteristics 
of this number are (min 1, ave 4/7, max oo, dev (4/7),/1 — 7/4), by exercise 17. 


7. If k =1, then nı = n and the problem is trivial. Otherwise it is always possible 
to find i 4 j such that n; < n < nj. Fill B; with n; cubes of color C; and n — n; of 
color C}, then decrease nj by n — n; and eliminate color Ci. We are left with the same 
sort of problem but with k reduced by 1; by induction, it’s possible. 

The following algorithm can be used to compute the P and Y tables: Form a list of 
pairs (p1, 1)... (px, k) and sort it by first components, obtaining a list (q1, a1)... (qk, ak) 
where qi < -+-+ < qk. Set n + k; then repeat the following operations until n = 0: Set 
Pla; — 1] + kqı and Y[aı — 1] + 2a,,. Delete (qi, a1) and (qn, an), then insert the new 
entry (qn — (1/k — q1), an) into its proper place in the list and decrease n by 1. 

(If p; < 1/k the algorithm will never put x; in the Y table; this fact is used 
implicitly in Algorithm M. The algorithm attempts to maximize the probability that 
V < Px in (3), by always robbing from the richest remaining element and giving it to 
the poorest. However, it is very difficult to determine the absolute maximum of this 
probability, since such a task is at least as difficult as the “bin-packing problem”; see 
Section 7.9.) 
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8. Replace P; by (j + P;)/k for 0< j < k. 
9. Consider the sign of f” (£) = \/2/n (£? — ie, 
10. Let Sj = (j—1)/5 for 1 < j < 16 and Pj+15 = F(S;41) —F(S;) —p; for 1 < j < 15; 
also let p31 = 1 — F(3) and ps2 = 0. (Eq. (15) defines pi, ..., pis.) The algorithm 
of exercise 7 can now be used with k = 32 to compute P; and Y;, after which we will 
have 1 < Y; < 15 for 1 < j < 32. Set Po + P32 (which is 0) and Yo + Y32. Then set 
Zj + 1/(5 — 5P;) and Y; + ŁY; — Zj for 0 < j < 32; Qj + 1/(5P;) for 1 < j < 15. 
Let h = } and fi41s(x) = y2/nr(e7?"? — e7350) piss for Sj < £ < Sj +h. 
Then let aj = fj415(S;) for 1 < j < 5, bj = fj+ıs(9;) for 6 < j < 15; also bj = 
—hfisis(Sj) + h) for 1 < j < 5, and aj = fj4i5(xj) + (£; — Sj)bj/h for 6 < j < 15, 
where zj is the root of the equation fj+15(£;) = —b;/h. Finally set Dj415 < aj/bj for 
1<j<15and Ej+ıs + 25/j for 1 < j <5, Ej+1s + 1/(e-Y/° 1) for 6 < j < 15. 
Table 1 was computed while making use of the following intermediate values: 
(pi, --., p31) = (.156, .147, .133, .116, .097, .078, .060, .044, .032, .022, .014, .009, .005, 
.003, .002, .002, .005, .007, .009, .010, .009, .009, .008, .006, .005, .004, .002, .002, .001, .001, 
003); (£6, ..., 215) = (1.115, 1.304, 1.502, 1.700, 1.899, 2.099, 2.298, 2.497, 2.697, 2.896): 
(a1,...,a15) = (7.5,9-1,9.5, 9.8, 9.9, 10.0, 10.0, 10.1, 10.1, 10.1, 10.1, 10.2, 10.2, 10.2, 10.2): 
(b1,...,b15) = (14.9, 11.7, 10.9, 10.4, 10.1, 10.1, 10.2, 10.3, 10.4, 10.5, 10.6, 10.7, 10.7, 10.8, 
10.9). 
11. Let g(t) = e9/2te-"/2 for t > 3. Since G(x) = fy g(t) dt =1- ee , arandom 
variable X with density g can be computed by = X+ GE! (1-V) = V9—2InV. 
Now e~*/2 < (t/3)e*/? for t > 3, so we obtain a valid rejection method if we accept 
X with probability f(X)/cg(X) = 3/X. 


12. We have f'(x) = xf(x) — 1 < 0 for æ > 0, since f(x) = 27+ — e72 Cer dt/t? 
for x > 0. Let z = aj—ı and y? = z? +2In2; then 


VZIT f° e dt = 12n e fl) < 3/2/me/? f(x) = 279, 

hence y > aj. 
13. Take bj = puj; consider now the problem with uj = 0 for each j. In matrix 
notation, if Y = AX, where A = (aij), we need AAT = C = (cij). (In other notation, if 
Y; = } ajxXx, then the average value of Y;Y; is > AikGjr-) If this matrix equation can 
be solved for A, it can be solved when A is triangular, since A = BU for some orthogonal 
matrix U and some triangular B, and BBT = C. The desired triangular solution can be 
obtained by solving the equations any = C11, 411421 = C12, aži +az5 = C22, 411031 = C13, 
a21431 + a22a32 = C23, ..., successively for a11, G21, a22, a31, Q32, etc. [Note: The 
covariance matrix must be positive semidefinite, since the average value of (>> yY)? 
is }> cijyiyj, which must be nonnegative. And there is always a solution when C is 
positive semidefinite, since C = U~'diag(\1,...,An)U, where the eigenvalues A; are 
nonnegative, and U~'diag(WA1,...,VAn)U is a solution.] 
14. F (a/c) if c > 0; the step function [x >0] if c= 0; or 1 — F(a2/c) if c < 0. 
15. Distribution f°. Fı(x — t)dF2(t). Density f°. fi(a — t) fe(t) dt. This is called 
the convolution of the given distributions. 
16. It is clear that f(t) < cg(t) for all t as required. Since fJ“ g(t) dt = 1 we have 

g(t) = Ct*! for 0 < t <1, Ce for t > 1, where C = ae/(a+ e). ye random variable 
with density g is easy to obtain as a mixture of two distributions, G(x) = x“ for 
0< z< 1, and G2(x) = 1 — et? for x > 1: 
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G1. [Initialize.] Set p «+ e/(a + e). (This is the probability that G, should be 
used.) 


G2. [Generate G deviate.] Generate independent uniform deviates U and V, where 
V #0. If U < p, set X + V1/4 and q 4+ e™¥; otherwise set X + 1 — lIn V 
and q + X*~'. (Now X has density g, and q = f(X)/cg(X).) 


G3. [Reject?] Generate a new uniform deviate U. If U > q, return to G2. | 


The average number of iterations is c = (a + e)/(elr (a + 1)) < 1.4. 

It is possible to streamline this procedure in several ways. First, we can replace 
V by an exponential deviate Y of mean 1, generated by Algorithm S, say, and then we 
set X + e ¥/* or X + 1 +Y in the two cases. Moreover, if we set q + pe~~ in the 
first case and q + p + (1 — p)X a—1 in the second, we can use the original U instead of 
a newly generated one in step G3. Finally if U < p/e we can accept V1/% immediately, 
avoiding the calculation of q about 30 percent of the time. 
17. (a) F(z) = 1 — (1 — p)!”!, for x > 0. (b) G(z) = pz/(1 — (1 — p)z). (c) Mean 
1/p, standard deviation yI — p/p. To do the latter calculation, observe that if H(z) = 
q+(1—4q)z, then H’(1) = 1—q and H"(1)+ H"(1) —(H’(1))? = q(1 — q), so the mean 
and variance of 1/H(z) are q — 1 and q(q — 1), respectively. (See Section 1.2.10.) In 
this case, q = 1/p; the extra factor z in the numerator of G(z) adds 1 to the mean. 


18. Set N «+ Ni + No —1, where Nı and N2 independently have the geometric 
distribution for probability p. (Consider the generating function.) 


19. Set N+ Ni +---+ Ni —t, where the Nj have the geometric distribution for p. 
(This is the number of failures before the tth success, when a sequence of independent 
trials are made each of which succeeds with probability p.) 

For t = p= i, and in general when the mean value (namely t(1 — p)/p) of the 
distribution is small, we can simply evaluate the probabilities pn = Caos) p'(1—p)” 
consecutively for n = 0, 1, 2, ... as in the following algorithm: 


N1. [Initialize] Set N + 0, q + p*, r + q, and generate a random uniform 
deviate U. (We will have q = pn and r = po +: - -+ pn during this algorithm, 
which stops as soon as U < r.) 

N2. [Iterate.] If U > r, set N + N+1,q¢ q1—p)(t{-14+N)/N,rertga, 
and repeat this step. Otherwise return N and terminate. [| 


An interesting technique for the negative binomial distribution, for arbitrarily 
large real values of t, has been suggested by R. Léger: First generate a random gamma 
deviate X of order t, then let N be a random Poisson deviate of mean X(1 — p)/p.] 


20. R1 = 1 + (1 — A/R): R1. When R2 is performed, the algorithm terminates with 
probability I/R; when R3 is performed, it goes to R1 with probability E/R. We have 


Ri R/A R/A R/A R/A 
R2 0 R/A 0 R/A 
R3 0 0 R/A RJA -I/A 


R4 RJA RJA-IJA RJA-EJA RJA-IJA-EJ/A 
21. R= \/8/e ~ 1.71553; A = VZIT (3/2) = \/n/2 ~ 1.25331. Since 
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we have I = a uva— budu = £a°/?/b? where a = 4(1 + Inc) and b = 4c; when 
c= e!/^, I has its maximum value žy 5/e ~ 1.13020. Finally the following integration 
formulas are needed for E: 

[Vbu—au? du= tba? arcsin(2ua/b—1)+ {ba ~" vbu—au? (2ua/b—1), 

JV bu+au? du=—}4ba 7’? ln(vbu+au?+uya+b/2ya)++ba ~ vbu+au? (2ua/b+1), 


where a,b > 0. Let the test in step R3 be “X? > 4e”~'/U—4z”; then the exterior region 
hits the top of the rectangle when u = r(x) = (e? — v'e? — 2ex)/2ex. (Incidentally, 
r(x) reaches its maximum value at x = 1/2, a point where it is not differentiable!) We 
have E = 2 Oe — ybu — au?) du where b = 4e”~! and a = 4a. The maximum 
value of E occurs near x = —.35, where we have E ~ .29410. 


22. (Solution by G. Marsaglia.) Consider the “continuous Poisson distribution” de- 
fined by G(x) = S e™*t® t dt/T(x), for « > 0; if X has this distribution then 
|X| is Poisson distributed, since G(x + 1) — G(x) = e “py /a!. If p is large, G is 
approximately normal, hence GI! (F„(x)) is approximately linear, where F,,(«) is 
the distribution function for a normal deviate with mean and variance pu; that is, 
F(a) = F((x—p)/\/m), where F(x) is the normal distribution function (10). Let 
g(a) be an efficiently computable function such that |G"! (F,(a)) — g(a)| < e for 
—oo < x < œ; we can now generate Poisson deviates efficiently as follows: Generate 
a normal deviate X, and set Y + g(u + VAX), N |Y], M + |Y +]. Then if 
|Y — M| > e, output N; otherwise output M — [GĦ (F(X)) < M]. 
This approach applies also to the binomial distribution, with 


-fai ici r(t+1) 
Ge) = f u (1 —u) du TT E+1 os)’ 


since |G'-(U)| is binomial with parameters (t, p) and G is approximately normal. 
[See also the alternative method proposed by Ahrens and Dieter in Computing 25 
(1980), 193-208.] 


23. Yes. The second method calculates |cos 20|, where 0 is uniformly distributed 
between 0 and 7/2. (Let U = rcos6, V = rsin 8.) 


25. a = (.10101)2. In general, the binary representation is formed by using 1 for | 
and 0 for &, from left to right, then suffixing 1. This technique [see K. D. Tocher, 
J. Roy. Stat. Soc. B16 (1954), 49] can lead to efficient generation of independent bits 
having a given probability p, and it can also be applied to the geometric and binomial 
distributions. 

26. (a) True: 5°, Pr(Mi = k)Pr(N2 = n — k) = e™”17#2? (441 + u2)”/n!. (b) False, 
unless u2 = 0; otherwise Nı — N2 might be negative. 

27. Let the binary representation of p be (.bıb2b3 ...)2, and proceed according to the 
following rules: 


B1. [Initialize.] Set m + t, N + 0, j + 1. (During this algorithm, m represents 
the number of simulated uniform deviates whose relation to p is still unknown, 
since they match p in their leading j—1 bits; and N is the number of simulated 
deviates known to be less than p.) 


B2. [Look at next column of bits.] Generate a random integer M with the binomial 


distribution (m, 4). (Now M represents the number of unknown deviates that 
fail to match bj.) Set m << m — M, and if bj = 1 set N + N + M. 
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B3. [Done?] If m = 0, or if the remaining bits (.bj41b;42...)2 of p are all zero, 
the algorithm terminates. Otherwise, set j + j+1 and return to step B2. J 


[When b; = 1 for infinitely many j, the average number of iterations A; satisfies 


Ao = 0; An = 1+ 5 D (pA for n> 1. 
k 


Letting A(z) = > Anz”/n!, we have A(z) = e* — 1+ A(4z) e*/?. Therefore A(z)e~* = 
1 = e77 + A(3z)e 7? = ieee — eo") =1-e* — yy, (-2)"/(nl (2" — 1), and 


=| -D j Vau] LI L1, LO -1 
pee or T gn- In2 j 2 T fo(n)4 (n ) 


k>1 ore 


in the notation of exercise 5.2.2—48.] 


28. Generate a random point (y1,..., Yn) on the unit sphere, and let p = Yo azyz. 
Generate an independent uniform deviate U, and if p"t'U < KV So azyz, output the 
point (y1/p,---,Yn/p); otherwise start over. Here K? = mind (X$ ann) T/S ay?) | 
Syk = 1} =a if nan > a, ((n + 1)/(a1 + an))"** (a1an/n)” otherwise. 


29. Let Xn+1 = 1, then set Xk + Xau” or Xk Xpie7 e/® fork=n,n-—1, 

, 1, where U; is uniform or Y; is exponential. [ACM Trans. Math. Software 6 (1980), 
359-364. This technique was introduced in the 1960s by David Seneschal; see Amer. 
Statistician 26,4 (October 1972), 56-57. The alternative of generating n uniform 
numbers and sorting them is probably faster, with an appropriate sorting method, but 
the method suggested here is particularly valuable if only a few of the largest or smallest 
X’s are desired. Notice that (F-"(X,),..., F'/-4(X,)) will be sorted deviates having 
distribution F.] 


30. Generate random numbers Zı = —p* nUi, Z2 = Zı— je ln U2, ..., until 
Zm+1 > 1. Output (X;,Y;) = f(Z;) for 1 < j < m, where f((.bib2... ber)2) = 
((.bib2 ... br)2, (.br4+1br2... bar)2 ). If the less significant bits are significantly less 
random than the more significant bits, it’s safer (but slower) to let f((bib2 ee bor)2) = 
((.b1b3 ed bər—1)2, (.b2b4 sii bor)2). 


31. (a) It suffices to consider the case k = 2, since a1. X1+---+a,X~ = X cos 0+Y sin 0 
when X = Xj, cos 0 = a1, and Y = (a2X2+---+axgXzx)/sin@. And 
f 1 —s?/2—t?/2 : 
Pr(X cos@ + Ysin@ < x) = z e ds dt |s cos 0 + tsin 0 < z] 
T 
1 —u —v 
F gor *?2 du dv [u< z] = (10), 

from the substitution u = s cos 0 + tsin 0, v = —ssin@+tcos0. 


(b) There are numbers a > 1 and 8 > 1 such that (a~™* + a~°’)/V2 = 1 and 
3 Br P84 48 -55 — 1; so the numbers X, will grow exponentially with n, by the properties 
of linear recurrences. 

If we break out of the linear recurrence mold by, say, using the recurrence Xn = 
Xn-—24 COS On + Xn—55 sin On, where Ôn is chosen uniformly in [0..27), we probably will 
obtain decent results; but this alternative would involve much more computation. 

(c) Start with, say, 2048 normal deviates Xo, ..., X1023, Yo, ..., Yio23. After 
having used about 1/3 of them, generate 2048 more as follows: Choose integers a, b, c, 


590 ANSWERS TO EXERCISES 3.4.1 


and d uniformly in [0..1024), with a and c odd; then set 


$ 
Xj X(aj+b) moa 1024 COS Ô + Y¥(ej+a) moa 1024 Sin 0, 


Y; 4 —X(aj+b) moa 1024 Sin 8 + Y(cj+a) mod 1024 COS 0, 


for 0 < j < 1024, where cos@ and sin 0 are random ratios (U? — V?) /(U? + V?) and 
2UV/(U? + V°), chosen as in exercise 23. We can reject U and V unless |cos0| > 4 
and |sin0| > Ł. The 2048 new deviates now replace the old ones. Notice that only a 
few operations were needed per new deviate. 

This method does not diverge like the sequences considered in (b), because the sum 
of squares J (X? + Y?) = 0((X})? + (¥/)?) remains at the constant value S ~ 2048, 
except for a slight roundoff error. On the other hand, the constancy of S is actually a 
defect of the method, because the sum of squares should really have the y? distribution 
with 2048 degrees of freedom. To overcome this problem, the normal deviates actually 
delivered to the user should be not X; but aXj;, where a? = 4(Yi023 + V4095)7/S 
is a precomputed scale factor. (The quantity $(Yio23 + V4095)? will be a reasonable 
approximation to the y? deviate desired.) 

References: C. S. Wallace [ACM Trans. on Math. Software 22 (1996), 119-127]; 
R. P. Brent [Lecture Notes in Comp. Sci. 1470 (1998), 1-20]. 


32. (a) This mapping (X’, Y’) = f(X,Y) is a one-to-one correspondence from the set 
{x,y > 0} to itself such that 2’ + y’ = x + y and dr’ dy’ = dx dy. We have 
X ( X Y | ( Y 
X'+Y' X+Y X'+Y' X+Y 
(b) This mapping is a two-to-one correspondence such that 2’ + y’ = x + y and 
dx’ dy’ = 2 dz dy. 
(c) It suffices to consider the “j-flip” transformation 


a) mod 1, } a) mod 1. 


/ 
X = (... Bj42054125Yj-1Yj-2Yj—3 - +» )25 


Y'= 
= (.. . Yj+2Yj+1YjTj—1Lj—2%j—3 . . )2, 


for a fixed integer j, and then to compose j-flips for j = 0, 1, —1, 2, —2, ..., noticing 
that the joint probability distribution of X’ and Y’ converges as |j| — œo. Each j-flip 
is one-to-one, with x’ + y’ = x + y and dr’ dy’ = dz dy. 

33. Use Uj as the seed for another random number generator (perhaps a linear con- 
gruential generator with a different multiplier); take U2, U3, ... from that one. 


SECTION 3.4.2 


1. There are Ce ways to pick n — m records from the last N — t, and ( 
ways to pick n — m — 1 from N — t — 1 after selecting the (t + 1)st item. 


se) 
n—m—1 


2. Step S3 will never go to step S5 when the number of records left to be examined 
is equal to n — m. 


3. We should not confuse conditional and unconditional probabilities. The quan- 
tity m depends randomly on the selections that took place among the first t elements; 
if we take the average over all possible choices that could have occurred among these 
elements, we will find that (n — m)/(N — t) is exactly n/N on the average. For 
example, consider the second element; if the first element was selected in the sample 
(this happens with probability n/N), the second element is selected with probability 
(n — 1)/(N — 1); if the first element was not selected, the second is selected with 
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probability n/(N — 1). The overall probability of selecting the second element is 
(n/N)((n—1)/(N —1)) + (1 — n/N) (n/(N = 1) = n/N. 


4. From the algorithm, 


n=) p(m, t) 4 n(n 1) p(m—1,t). 


The desired formula can be proved by induction on t. In particular, p(n, N) = 1. 


5. In the notation of exercise 4, the probability that t = k at termination is q, = 
p(n, k) — p(n, k — 1) = C: The average is D kan = (N + 1)n/(n +1). 

6. Similarly, be k(k+1)qx = (N +2)(N + 1)n/(n + 2); the variance is therefore 
(N +1)(N —n)n/(n + 2)(n +1). 

7. Suppose the choice is 1 < zı < £2 < +++ < an < N. Let ro = 0, £n = N+1. 
The choice is obtained with probability p = [],-,<y pt, where 


Sf re a leas for Im <t <@m4i3 
(n—m)/(N —(t-1)), for t = m41- 


The denominator of the product p is N!; the numerator contains the terms N — n, 
N—n-—1,..., 1 for those t’s that are not x’s, and the terms n, n—1,..., 1 for those 
ts that are x’s. Hence p = (N —n)!n!/N!. 

Example: n = 3, N = 8, (#1, 22,23) = (2,3,7); p = 3243211, 


8. (a) p(0,k) = (“7 *)/(%) = (472) / (X) of the (Y) samples omit the first k records. 

(b) Set X + k — 1, where k is minimum with U > Pr(X > k). Thus, start with 
X+0,p¢+N-n,q¢N, R+ p/q, and while U < R set X + X41, p¢ p-1, 
q + q—1, R + Rp/q. (This method is good when n/N is, say, > 1/5. We can assume 
that n/N < 1/2; otherwise it’s better to select N — n unsampled items.) 

(c) Pr(min(Yw,...,Yiv-nei) > k) = JEZ Pr(Yn-; > k) = TEZEN -j—k)/ 
(N — j)). (This method is good if, say, n < 5.) 

(d) (See exercise 3.4.1-29.) The value X + |N(1—U*/")| needs to be rejected 
with probability only O(n/N). Precise details are worked out carefully in CACM 
27 (1984), 703-718, and a practical implementation appears in ACM Trans. Math. 
Software 13 (1987), 58-67. (This method is good when, say, 5 < n < £N.) 

After skipping X records and selecting the next, we set n + n—1, N e N-X-1, 
and repeat the process until n = 0. A similar approach speeds up the reservoir method; 
see ACM Trans. Math. Software 11 (1985), 37-57. 

9. The reservoir gets seven records: 1, 2, 3, 5, 9, 13, 16. The final sample consists of 
records 2, 5, 16. 


10. Delete step R6 and the variable m. Replace the J table by a table of records, 
initialized to the first n records in step R1, and with the new record replacing the Mth 
table entry in step R4. 


11. Arguing as in Section 1.2.10, which considers the special case n = 1, we see that 
the generating function is 


ee ee oe ee 

nt+1  n+1 n+2 n42 ONN N” 

The mean is n + >), <i<n (n/t) = n(1 + Hw — Hn); and the variance turns out to be 
n(Hn — Hn) — n?(H® — H®). 
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12. (Note that 77" = (byt)... (b33)(b22), so we seek an algorithm that goes from the 
representation of to that for 7+.) Set bj + j for 1 < j < t. Then for j = 2, 3, ..., t 
(in this order), interchange bj + baj. Finally for j = t, ..., 3, 2 (in this order), set 
ba; < bj. (The algorithm is based on the fact that (a,t)71 = 71(bit).) 


13. Renumbering the deck 0, 1, ..., 2n — 2, we find that s takes card number zx into 
card number (2x) mod (2n—1), while c takes card x into (x— 1) mod (2n—1). We have 
(c followed by s) = cs = sc”. Therefore any product of c’s and s’s can be transformed 
into the form s‘c*®. Also 2°@"-)) = 1 modulo (2n—1); since s°°"~)) and c?"~" are the 
identity permutation, at most (2n—1)y(2n—1) arrangements are possible. (The exact 
number of different arrangements is (2n—1)k, where k is the order of 2 modulo (2n—1). 
For if s* = cœ, then œ fixes the card 0, so s* = c! = identity.) For further details, see 
SIAM Review 3 (1961), 293-297. 


14. (a) Q. We could have deduced this regardless of where he had moved it, unless 
he had put it into one of the first three or last two positions. (b) 2. Three cut-and- 
rifles will produce an intermixture of at most eight cyclically increasing subsequences 
Ax; Q(a;+1) mod n-++@(e;;1—1) mod n; hence the subsequence 6 5 4 is a dead giveaway. 
[Several magic tricks are based on the fact that three cut-and-riffles are highly non- 
random; see Martin Gardner, Mathematical Magic Show (Knopf, 1977), Chapter 7.] 


15. Set Yj + j for t—n < j <t. Then for j=t,t—1,...,t—m+1 do the following 
operations: Set k + |jU|+1. If k >t—n then set Xj + Yk and Yp + Yj; otherwise if 
k = X; for some i > j (a symbol table algorithm could be used), then set X; + Y; and 
Y; + Yj; otherwise set X; + k. (The idea is to let Yi-n+1, ..., Yj represent Xt-n+1, 
..., Xj, and if i > j and X; < t — n also to let Y; represent Xx,, in the execution of 
Algorithm P. It is interesting to prove the correctness of Dahl’s algorithm. One basic 
observation is that, in step P2, Xx Æ k implies Xz > j, for 1 < k < j.) 


16. We may assume that n < iN, otherwise it suffices to find the N — n elements not 
in the sample. Using a hash table of size 2n, the idea is to generate random numbers 
between 1 and N, storing them in the table and discarding duplicates, until n distinct 
numbers have been generated. The average number of random numbers generated is 
N/N +N/(N—1)+---+N/(N—n+1) < 2n, by exercise 3.3.2-10, and the average 
time to process each number is O(1). We want to output the results in increasing 
order, and this can be done as follows: Using an ordered hash table (exercise 6.4—66) 
with linear probing, the hash table will appear as if the values had been inserted in 
increasing order and the average total number of probes will be less than 3n. Thus 
if we use a monotonic hash address such as |2n(k —1)/N| for the key k, it will be a 
simple matter to output the keys in sorted order by making at most two passes over 
the table. [See CACM 29 (1986), 366—367.] 


17. Show inductively that before step j, the set S is a random sample of j— N—1+n 
integers from {1,...,j7 —1}. [CACM 30 (1987), 754-757. Floyd’s method can be used 
to speed up the solution to exercise 16. It is essentially dual to Dahl’s algorithm in 
exercise 15, which operates for decreasing values of j; see exercise 12.] 


18. (a) Oriented trees that essentially merge (1,2,...) with (n,n —1,...), such as 


1 2 3 4 56 78910 1112 13 14 


tuy r uT aT T E A AA AAAS 


2625 2423 22212019 18 17 16 P 
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(b) Collections of 1-cycles and 2-cycles. (c) Binary search trees on the keys (1,2,...,7), 
with k; the parent of j (or j, at the root); see Section 6.2.2. The number of (k1,..., kn) 
in each case is (a) 2"~!; (b) tn > vnl, see 5.1.4-(40); (c) n) z- [Case (a) represents 
the least common permutation; case (b) represents the most common, when n > 18. 
See D. P. Robbins and E. D. Bolker, Æquationes Mathematicæ 22 (1981), 268-292; 


D. Goldstein and D. Moews, Æquationes Mathematicæ 65 (2003), 3-30.] 
19. See N. Duffield, C. Lund, and M. Thorup, JACM 54 (2007), 32:1-32:37. 


SECTION 3.5 


1. A b-ary sequence, yes (see exercise 2); a [0..1) sequence, no (since only finitely 
many values are assumed by the elements). 


2. It is 1-distributed and 2-distributed, but not 3-distributed (the binary number 111 
never appears). 


3. Repeat the sequence in exercise 3.2.2-17, with a period of length 27. 


4. If n(n), ve(n), v3(n), va(n) are the counts for the four probabilities, we have 
V1(n)+v2(n) = v3(n)+v4(n) for all n. So the desired result follows by addition of limits. 


1 Ot tT bo oo oo oo ae 
5. The sequence begins 3, 3, 3, 3, 3. 3: 3» 3) 99 37 97 90 97 > gr etc. When n = 1, 3, 


7, 15, ... we have v(n) = 1, 1, 5, 5, ... so that v(2?*-t — 1) = v(2?* — 1) = (2?*—1)/3; 
hence v(n)/n oscillates between + and approximately 3, and no limit exists. The 
probability is undefined. [The methods of Section 4.2.4 show, however, that a numerical 
value can meaningfully be assigned to Pr(Un < 4) = Pr(leading digit of the radix-4 


representation of n + 1 is 1), namely log, 2 = $.] 

6. By exercise 4 and induction, Pr($;(n) for some j, 1 < j < k) = aan Pr(S;(n)). 
As k — oo, the latter is a monotone sequence bounded by 1, so it converges; and 
Pr(S;(n) for some j > 1) > ae Pr(S;(n)) for all k. For a counterexample to equal- 
ity, it is not hard to arrange things so that S;(n) is always true for some j, yet 
Pr(Sj(n)) = 0 for all j. 

7. Let pi =} i>: Pr(9;;(n)). The result of the preceding exercise can be generalized 
to Pr(S;(n) for some j > 1) > Yyo1 Pr(S;(n)), for any disjoint statements S;(n). 
So we have 1 = Pr(Si;(n) for some i,j > 1) > Disa Pr( Siz (n) for some j > 1) > 
id Pi = 1, and hence Pr(Si;(n) for some j > 1) = pi. Given € > 0, let I be large 
enough so that a pi >1—e. Let 


¢i(N) = (number of n < N with Sij(n) true for some j > 1)/N. 


Clearly $>}; ¢i(N) < 1, and for all large enough N we have DZ, ¢:i(N) > X, pi—6; 
hence $1(N) < 1—d2(N)—---—$1(N) < 1—po—-+-—prt+e < 1—-(1-e—pi) +e = pit2e. 
This proves that Pr($1;(n) for some j > 1) < pı + 2e; hence Pr($1;(n) for some 
j > 1) = pı, and the desired result holds for i= 1. By symmetry of the hypotheses, it 
holds for any value of i. 

8. Add together the probabilities for j, j +d, j + 2d, ..., m+ j — d in Definition E. 


9. limsup,_,.,(@n + bn) < limsup,,_,,, an + limsup,,_,,, bn; hence we find that 


lim sup((yin a)? be: + (Ymn a)”) < ma? —2ma? + ma? = 0, 
n—->oco 


and this can happen only if each (yjn — a) tends to zero. 


10. In the evaluation of the sum in Eq. (22). 
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11. (Un) is k-distributed if (U,,) is (2,2k — 1)-distributed. 
12. Apply Theorem B with f(x1,..., £k) = [w<max(m1,..., £k) <v]. 
13. Let 
pr = Pr(U, begins a gap of length k — 1) 
= Pr(Un-1 € [æ .. 8), Un ¢ [a .. 8), ---, Untk- ¢ la.. B), Un+e-1 € fa.. 8)) 
2 k—1 

=p (1-p)" . 
It remains to translate this into the probability that f(n) — f(n—1) =k. Let v(n) = 
(number of j < n with f(j) — f(g — 1) = k); let x(n) =(number of j < n with Uj the 
beginning of a gap of length k—1); and let u(n) similarly count the number of 1 <j <n 


with U; € [a.. 8). We have pz (f(n)) = vr(n), u(f(n)) =n. As n > œ, we must have 
f(n) —> oo, hence 


ve(n)/n = (ue(f(n))/F(n)) - (f(r) /u(f())) > p/p = pa — p). 
[We have only made use of the fact that the sequence is (k + 1)-distributed.] 
14. Let px = Pr(U, begins a run of length k) 
= Pr(Un-1 > Un < +++ < Un+r-1 > Un+k) 
en wake EP) l ie k k+1 
(k+2)! 1 1 1 1)! (k+1)! (k+2)! 
(see exercise 3.3.2-13). Now proceed as in the previous exercise to transfer this to 


Pr(f(n)—f(n—1) =k). [We have assumed only that the sequence is (k+2)-distributed.] 
15. For s,t > 0 let 


= Pr(Xn 2t 3=Xn 2t—2 Xn 2t—1 PRS Xn 1 and X,=---=Xni+s Xn+s+1) 
— 9-8—2t-3, 
for t > 0 let qe = Pr(Xn—2t-2 = Xn-2t-1 ++: Æ Xn-1) = g=- a exercise 7, 


Pr(Xn is not the beginning of a coupon set) = D749 ¢ = 33 
Pr(Xn is the beginning of coupon set of length s + 2) = J >o Pst = i. gmg 


Now proceed as in exercise 13. 

16. (Solution by R. P. Stanley.) Whenever the subsequence S = (b — 1), (b — 2), ..., 

1, 0, 0, 1, ..., (b — 2), (b — 1) appears, a coupon set must end at the right of S, since 

some coupon set is completed in the first half of S. We now proceed to calculate the 

probability that a coupon set begins at position n by manipulating the probabilities 

that the last prior appearance of S ends at position n — 1, n — 2, etc., as in exercise 15. 

18. Proceed as in the proof of Theorem A to calculate Pr and Pr. 

19. (Solution by T. Herzog.) Yes. For example, apply exercise 33 to the sequence 

(U\n/2}), When (Un) satisfies R4 (or even its weaker version). 

20. (a) 2 and 4. (When n increases, we break 1 in half.) 

b) Each new point breaks a single interval into two pate Let p be equal to 

max7_g ((n + k)i). Then 1 = J}: IP < S a <L * p/(n+k)=pn2+ 
O(1/n). So infinitely many m have mi}? > 1/In2+ O(1/m). 

c) To verify the hint, let ie (F) come from the interval with endpoints Um and Um’, 


and set ax = max(m—n,m’—n, 1). Then p = min?”_ n41 mẹ? ) implies 1 = 522, ih > 
De p/n + ak) > 25a 1/(n +k); hence 2p < 1/(Han — Hn )=1/n2 + Oia): 
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d) We have (1, ...,1) = (lg 241, 1g 2+2,... 
point always breaks the renee intetval. into intervals of length lg =} 
[Indagationes Math. 11 (1949), 14-17.] 

21. (a) No! We have Pr(W, < 4) > limsup,_,., v([2"-1/7])/f2"-1/7] = 2- v2, 
and Pr(Wn < 4) < liminfn+..(2")/2” = V2 —1, because v( [ar-1/21) = v(2") = 
1E o2" — 2) + O(n). 

b,c) See Indagationes Math. 40 (1978), 527-541. 


22. If the sequence is k-distributed, the limit is zero by integration and Theorem B. 
Conversely, note that if f(x1,..., £k) has an absolutely convergent Fourier series 


‚lg 52%), because the (n + 1)st 


antl 2n+2 
n+l’ 


and lg 5 


f(x1,...,&k) = 5 a(cı,..., Ck) exp(2mi(ciz1 + +++ +cksk)), 


m a TEEN e e w 


we have limno $ X o<cneN f(Un,...,Un+r-1) = a(0,...,0) + €r, where 
ler| < 5 la(ci,..-,Ck)|, 
max{|c1|,..., lez |}>r 


so €r can be made arbitrarily small. Hence this limit is equal to 


hi al 
o= f -f f(£1,..., £k) dai... dk, 
0 0 


and Eq. (8) holds for all sufficiently smooth functions f. The remainder of the proof 
shows that the function in (9) can be approximated by smooth functions to any desired 
accuracy. 


23. (a) This follows immediately from exercise 22. (b) Use a discrete Fourier transform 
in an analogous way; see D. E. Knuth, AMM 75 (1968), 260-264. 


24. (a) Let c be any nonzero integer; we must show, by exercise 22, that 


1 
= 5 PEN) as N >œ. 


; : : spies A K-1N-1 27i 
This follows because, if K is any positive integer, we have Jio Dalo CU te = 


KEG 1 e?mcUn + O(K?). Hence, by Cauchy’s inequality, 


2 


N-1 N-1K-1 
1 2nicUn 1 2ricUnrK K 
N2 > € ~ K2N2 c i“ ate o(=) 
N n=0 K?N n=0 k=0 N 
2 
N-1|K-1 
1 2ricU, K 
< e ntk] + o(=) 
= 2 
KON n=0 | k=0 N 


i a 2 n( 5 D erani) i o(Ș) z3% ES 

K K?N 4 N K` 
0<j<k<K n=0 

(b) When d = 1, exercise 22 tells us that ((ai1n + ao) mod 1) is equidistributed 

if and only if a; is irrational. When d > 1, we can use (a) and induction on d. 

[Acta Math. 56 (1931), 373-456. The result in (b) had previously been obtained in a 
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more complicated way by H. Weyl, Nachr. Gesellschaft der Wiss. Gottingen, Math.- 
Phys. K1. (1914), 234-244. A similar argument proves that the polynomial sequence is 
equidistributed if at least one of the coefficients ag, ..., a1 is irrational.] 

25. If the sequence is equidistributed, the denominator in Corollary S approaches a: 
and the numerator approaches the quantity in this exercise. 

26. See Math. Comp. 17 (1963), 50-54. [Consider also the following example by A. G. 
Waterman: Let (Un) be an equidistributed [0 ..1) sequence and (Xn) an co-distributed 
binary sequence. Let Vn = Ur z] OF 1 Ur TA according as Xn is 0 or 1. Then (Vn) is 
equidistributed and white, but Pr(V; = Vn+1) = L. Let Wn = (Vn — €n) mod 1 where 
(€n) is any sequence that decreases monotonically to 0; then (Wn) is equidistributed 
and white, yet Pr(Wn < Wn+1) = 3] 

28. Let (Un) be oo-distributed, and consider the sequence ($(Xn + Un)). This is 
3-distributed, using the fact that (U,) is (16,3)-distributed. 

29. If x = x1%2...2¢ is any binary number, we can consider the number ve (n) of 
times Xp... Xp+t-1 = x, where 1 < p < n and p is even. Similarly, let vo (n) count 
the number of times when p is odd. Let v? (n) + vO (n) = vz(n). Now 


vo (n) = XO Vx... (n) ~ Sonal) x XO vko.a(n) ERS XO vla. oln) 


where the v’s in these summations have 2k subscripts, 2k — 1 of which are asterisks 
(meaning that they are being summed over —each sum is taken over 2?*~* combina- 
tions of zeros and ones), and where “~” denotes approximate equality (except for an 
error of at most 2k due to end conditions). Therefore we find that 


aw ekyy (n) = 3 (X vosa (0) +++ + E Varr o(N)) Z2 a(r (2) — s(x))ve'(n) + O (=), 


where z = 21... 22% contains r(x) zeros in odd positions and s(x) zeros in even posi- 
tions. By (2k)-distribution, the parenthesized quantity tends to k(2?*~1)/2?* = k/2. 


The remaining sum is clearly a maximum if v (n) = v,(n) when r(x) > s(x), and 


vE(n) = 0 when r(x) < s(x). So the maximum of the right-hand side becomes 


Be Steno) fate S ony Ym 


Now Pr(Xon = 0) < limsup,,_,,, vd (2n)/n, so the proof is complete. Note that we 


have 
eC) @ max(r, s) = 2n2?"~? + n(*" 7 i 
his We n 
a (2) min(r, s) = 2n2?”~? — n(* 7 j. 
i s n 


30. Construct a digraph with 2?* nodes labeled (Ex ...£2k—1) and (Ox1...X2kn-1), 


where each x; is either 0 or 1. Let there be 1+ f(x1,22,...,22n) directed arcs from 
(F21...%2~-1) to (Org... 42x), and 1 — f(x1,%2,...,22%) directed arcs leading from 
(Ox1...%2k-1) to (Ex2...£2k), where f(a1,%2,..., Van) = sign(xı — v2 + £3 — ta + 

-— 22). We find that each node has the same number of arcs leading into it 
as there are leading out; for example, (Exa ...£2ķ—-1) has 1— f(0,1,...,%2~-1) + 
1— f(1,21,...,%2~~1) leading in and 1+ f(a1,...,v2n-1,0) + 1 + f(z1,..., £2k—1, 1) 
leading out, and f(x,21,...,@2n-1) = —f(x1,...,@2k-1, x). Drop all nodes that have 


no paths leading either in or out, namely (fa,...x2%~-1) if f(0,a1,...,v2n-1) = +1, 
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o ı 0 oz 0 0 if go o 1 of ge io o} ~o 0 0 1 


O 1 1 1 


Fig. A—5. Directed graph for the construction in exercise 30. 


or (Ozx1 ...£2k—1) if f(1, £1,...,£2k—-1) = —1. The resulting directed graph is seen to 
be connected, since we can get from any node to (£1010...1) and from this to any 
desired node. By Theorem 2.3.4.2G, there is a cyclic path traversing each arc; this path 
has length 2?*+", and we may assume that it starts at node (E00...0). Construct a 
cyclic sequence with X1 =--- = Xox-1 = 0, and Xn+2k—1 = Lox if the nth arc of the 
path is from (Ezxı ... £2k—1) to (Ox2...%2~%) or from (Ox1...@%2~~-1) to (Ex2... £k). 
For example, the graph for k = 2 is shown in Fig. A-5; the arcs of the cyclic path are 
numbered from 1 to 32, and the cyclic sequence is 


(00001000110010101001101110111110)(00001...). 


Notice that Pr(X2, = 0) = 7% in this sequence. The sequence is clearly (2k)-distrib- 


uted, since each (2k)-tuple 2122... £2% occurs 
1+ f(xi,...,var) +1— f(a1,...,r2n) =2 


times in the cycle. The fact that Pr(X2n = 0) has the desired value comes from the fact 
that the maximum value on the right-hand side in the proof of the preceding exercise 
has been achieved by this construction. 


31. Use Algorithm W with rule Rı selecting the entire sequence. [For a generalization 
of this type of nonrandom behavior in R5-sequences, see Jean Ville, Etude Critique de 
la Notion de Collectif (Paris: 1939), 55-62. Perhaps R6 is also too weak, from this 
standpoint, but no such counterexample is presently known.] 


32. IfR,R’ are computable subsequence rules, so is R” = RR’ defined by the following 

functions: fy/(xo,...,;%n—1) = 1 if and only if R defines the subsequence Tri, ..., Ur, 

of xo, ..-, Zn—1, where k > 0 and 0 < r1 <- < re <N and fi(ar,,...,r,) = 1. 
Now (Xn)RR’ is ((Xn)R)R’. The result follows immediately. 


33. Given e€ > 0, find No such that N > No implies that both |v,(N)/N — p| < € and 
|vs(N)/N — p| < €. Then find Ni such that N > Nj implies that ty is rm or sm for 
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some M > No. Now N > N, implies that 
%(N Vr N, + Vs Ns 
E 


N N 


34. For example, if the binary representation of t is (1 0°~? 10% 110% 1 ... 10% )s, 
where “0°” stands for a sequence of a consecutive zeros, let the rule Rs accept Un if 
and only if [bUn-k] = a1, ..., |[bUn—1| = an. 


35. Let ao = so and am41 = max{s, | 0 < k < 2°™}. Construct a subsequence rule 
that selects element Xn if and only if n = są for some k < 2°”, when n is in the range 
am <2 < am+1. Then limp V(Gm)/am = Ł, 
36. Let b and k be arbitrary but fixed integers greater than 1. Let Y, = |bU,|. An 
arbitrary infinite subsequence (Zn) = (Ys„)}R determined by algorithms S and R (as 
in the proof of Theorem M) corresponds in a straightforward but notationally hopeless 
manner to algorithms S’ and R’ that inspect Xz, Xt+1, ---, Xt+s and/or select X+, 
Xt+1, +--+; Xt4min(k—1,s) Of (Xn) if and only if S and R inspect and/or select Ys, where 
Us = (0.X:Xt+1. -- Xt+s)2. Algorithms S’ and R’ determine an infinite 1-distributed 
subsequence of (Xn) and in fact (as in exercise 32) this subsequence is oœo-distributed 
so it is (k, 1)-distributed. Hence we find that Pr(Z, = a) and Pr(Z,, = a) differ from 
1/b by less than 1/2*. 

[The result of this exercise is true if “R6” is replaced consistently by “R4” or “R5”; 
but it is false if “R1” is used, since X (2) might be identically zero.] 
37. For n > 2 replace U,2 by $(Un2 + ôn), where 6, = 0 or 1 according as the 


2 


set {U(,-1)241;---,U,2-1} contains an even or odd number of elements less than z, 
[Advances in Math. 14 (1974), 333-334; see also the Ph.D. thesis of Thomas N. Herzog, 


Univ. of Maryland (1975).] 
39. See Acta Arithmetica 21 (1972), 45-50. The best possible value of c is unknown. 


40. Since Fk depends only on Bı... Bk, we have P(A}, $n) = 3. Let q(Bi... Br) = 
Pr(Bk+1 = 1 | Bi... Bx), where the probability is taken over all elements of S having 
Bı... By as the first k bits. Similarly, let qo(Bi... Be) = Pr(Fk = 1 and Bras =b | 
Bı... By). Then we have Pr(A§ = 1 | Bi... Be) = Pr((Fe+Bp}1+Bk41) mod 2 = 1 | 
By... Be) =q (4—0 +q) +(1—q): (q0o+ 1-41) = 4 — (qo +41) +2(gqit (1 — q)q0) = 
3 — Pr(Fk = 1 | Bi... Be) + 2Pr(Fk = 1 and Bhi, = Bryi | Bi... By). Hence 
Pr(Ay =1) = Dp, p, Pr(Bi--. Be) Pr(Ak = 1 | Bi... Bk) = 3 — Pr(F. = 1) 4 
Pr(Fk+1 = 1). [See Theorem 4 of Goldreich, Goldwasser, and Micali in JACM 33 
(1986), 792-807.] 


41. Choose k uniformly from {0,...,N — 1} and use the construction in the proof of 
Lemma P1. Then the proof of P1 shows that A’ will be equal to 1 with probability 
N-l1;l1 
n—o (3 — Pe + Prti)/N. 


42. (a) Let X = Xı +: + Xn. Clearly E(X) = np; and we have E((X — np)?) 
EX? — np? = nEXF + 2Dicsejyen(EXi)(EX;) — np = nE XF -np = no’. 
Also E((X — np)?) = ese @Pr((X — np)? =x) > Dee x Pr((X — np)? = x) > 
Vi esine2 tno? Pr((X — np)? = x) = tno? Pr((X — ny)? > tno”). 

(b) There is a position i where c; Æ cj, say ci = 0 and c; = 1. Then there’s a 
position j where c; = 1. For any fixed setting of B in the k — 2 rows other than i or j, 
we have (cB, c'B) = (d, d’) if and only if rows i and j have particular values; this occurs 
with probability 1/27”. 


© lI 
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(c) In the notation of Algorithm L, take n = 2" — 1 and Xe = (—1)¢(-? +); then 
u= s and g? = 1 — 8°. The probability that X = D Xe is negative is at most the 
probability that (X — nu)? > n?u?. By (a) this is at most o7/(np”). 


43. The conclusion for fixed M would be of no interest, since there obviously exists an 
algorithm to factor any fixed M (namely, an algorithm that knows the factors). The 
theory applies to all algorithms that have short running time, not only to algorithms 
that are effectively discoverable. 


44. If every one-digit change to a random table yields a random table, all tables are 
random (or none are). If we don’t allow degrees of randomness, the answer must 
therefore be, “Not always.” 


SECTION 3.6 
1. RANDI STJ 9F Store exit location. 

STA 8F Store value of k. 
LDA XRAND rA¢ X. 
MUL 7F rAX ¢ ax. 
INCX 1009 rX + (aX +c) mod m. 
JOV *+1 Ensure that overflow is off. 
SLAX 5 rA + (aX +c) mod m. 
STA XRAND Store X. 
MUL 8F tA + |kX/m]. 
INCA 1 Add 1, so that 1< Y < k. 

9H JMP * Return. 

XRAND CON 1 Value of X; Xo = 1. 

8H CON 0O Temp storage of k. 


7H CON 3141592621 The multiplier a. J 


2. Putting a random number generator into a program makes the results essentially 
unpredictable to the programmer. If the behavior of the machine on each problem were 
known in advance, few programs would ever be written. As Turing has said, the actions 
of a computer quite often do surprise its programmer, especially when a program is 
being debugged. 

So the world had better watch out. 


7. In fact, you only need the 2-bit values |X„/2*°] mod 4; see D. E. Knuth, IEEE 
Trans. IT-31 (1985), 49-52. J. Reeds, Cryptologia 1 (1977), 20-26, 3 (1979), 83-95, 
initiated the study of related problems; see also J. Boyar, J. Cryptology 1 (1989), 177- 
184. In SICOMP 17 (1988), 262-280, Frieze, Hastad, Kannan, Lagarias, and Shamir 
discuss general techniques that are useful in problems like this. 


8. We can, say, generate X1000000 by making one million successive calls, and compare 
it to the correct value (a10000°° Xo + (410009 — 1)e/(a — 1)) mod m, which can also 
be expressed as ((a700°°( Xo(a — 1) +c) — c) mod (a — 1)m)/(a — 1). The latter can 
be evaluated quickly by an independent method (see Algorithm 4.6.3A). For example, 
482711000000 mod 2147483647 = 1263606197. Most errors will be detected, because 
recurrence (1) is not self-correcting. 


9. (a) The values of Xo, X1, ..., Xog are not all even. The polynomial z'°° + 237 +1 
is primitive (see Section 3.2.2); hence there is a number h(s) such that Po(z) = 2”) 
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(modulo 2 and z+% +237 +1). Now zPnii(z) = Pra(z) — Xn2z7" — Xn+63 + Xn+63z 1% + 
Xn41002°" = Paz) +Xn+463 (20? +23" +1) (modulo 2), so the result holds by induction. 

(b) The operations “square” and “multiply by z” in ran_start change p(z) = 
£9927? +--- +212 + zo to p(z)? and zp(z), respectively, modulo 2 and z'°° + 237 +1, 
because p(z)? = p(z*). (We consider here only the low-order bits. The other bits are 
manipulated in an ad hoc way that tends to preserve and/or enhance whatever disorder 
they already have.) Therefore if s = (1s; ...8180)2 we have h(s) = (1sos1...8;1)2-2%. 

(o) girs gis =n (modulo 2 and z'°° + 23” + 1) implies that h(s) — n 
h(s') — n’ (modulo 21°° — 1). Since 28° < h(s) < 210° — 2°, we have jn — n’| 
|h(s) — h(s’)| > 27°, 

[This method of initialization was inspired by comments of R. P. Brent, Proc. 
Australian Supercomputer Conf. 5 (1992), 95-104, although Brent’s algorithm was 
completely different. In general if the lags are k > l, if 0 < s < 2°, and if the separation 
parameter t satisfies t+ e < k, this method of proof shows that |n — n’| > 2° — 1, with 
2* — 1 occurring only if {s, s’} = {0, 2° — 1}.] 


10. The following code belongs to the simplified language Subset FORTRAN, as de- 
fined by the American National Standards Institute, except for its use of PARAMETER 
statements for readability. 


IV Ill 


SUBROUTINE RNARRY(AA,N) 
IMPLICIT INTEGER (A-Z) 
DIMENSION AA(*) 
PARAMETER (KK=100) 
PARAMETER (LL=37) 
PARAMETER (MM=2**30) 
COMMON /RSTATE/ RANX(KK) 
SAVE /RSTATE/ 
DO 1 J=1,KK 
1 AA(J)=RANX (J) 
DO 2 J=KK+1,N 
AA(J)=AA(J-KK) -AA(J-LL) 
IF (AA(J) .LT. 0) AA(J)=AA(J)+MM 
2 CONTINUE 
DO 3 J=1,LL 
RANX (J)=AA (N+J-KK) -AA(N+J-LL) 
IF (RANX(J) .LT. 0) RANX(J)=RANX(J)+MM 
3 CONTINUE 
DO 4 J=LL+1,KK 
RANX (J) =AA (N+J-KK) -RANX(J-LL) 
IF (RANX(J) .LT. 0) RANX(J)=RANX(J)+MM 
4 CONTINUE 
END 


SUBROUTINE RNSTRT (SEED) 
IMPLICIT INTEGER (A-Z) 
PARAMETER (KK=100) 
PARAMETER (LL=37) 
PARAMETER (MM=2**30) 
PARAMETER (TT=70) 


3.6 


10 


12 


14 


16 


20 


21 


22 


ANSWERS TO EXERCISES 


PARAMETER (KKK=KK+KK-1) 
DIMENSION X (KKK) 

COMMON /RSTATE/ RANX(KK) 
SAVE /RSTATE/ 


IF (SEED .LT. 0) THEN 
SSEED=MM-1-MOD (-1-SEED , MM) 
ELSE 
SSEED=MOD (SEED , MM) 
END IF 


SS=SSEED-MOD (SSEED , 2) +2 
DO 1 J=1,KK 
X(J)=SS 
SS=SS+SS 
IF (SS .GE. MM) SS=SS-MM+2 
CONTINUE 
X(2)=X(2) +1 


SS=SSEED 
T=TT-1 
DO 12 J=KK,2,-1 
X(J+J-1) =X (J) 
X(J+J-2)=0 
DO 14 J=KKK,KK+1,-1 
X (J-(KK-LL) )=X (J- (KK-LL) )-X (J) 
IF (X(J-(KK-LL)) .LT. 0) X(J-(KK-LL))=X(J-(KK-LL) )+MM 
X (J-KK) =X (J-KK)-X(J) 
IF (X(J-KK) .LT. 0) X(J-KK)=X(J-KK)+MM 
CONTINUE 


IF (MOD(SS,2) .EQ. 1) THEN 
DO 16 J=KK,1,-1 
X(J+1) =X (J) 
X(1)=K (KK+1) 
X(LL+1) =X (LL+1) -X (KK+1) 
IF (X(LL+1) .LT. 0) X(LL+1)=X(LL+1)+MM 
END IF 


IF (SS .NE. 0) THEN 
SS=SS/2 

ELSE 
T=T-1 

END IF 

IF (T .GT. 0) GO TO 10 


DO 20 J=1,LL 
RANX ( J+KK-LL) =X (J) 
DO 21 J=LL+1,KK 
RANX (J-LL) =X (J) 
DO 22 J=1,10 
CALL RNARRY (X,KKK) 
END 
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11. Floating point arithmetic on 64-bit operands conforming to ANSI/IEEE Standard 


754 allows us to compute Un = (Un—100 — Un—37) mod 1 with perfect accuracy for 
fractions Un» that are integer multiples of 2~°°. However, the following program uses 
the additive recurrence Up = (Un—100 + Un—37) mod1 on integer multiples of 275? 


instead, because pipelined computers can subtract an integer part more quickly than 
they can branch conditionally on the sign of an intermediate result. The theory of 
exercise 9 applies equally well to this sequence. 

A FORTRAN translation similar to the code in exercise 10 will generate exactly 
the same numbers as this C routine. 


#define KK 100 /* the long lag */ 
#define LL 37 /* the short lag */ 
#define mod_sum(x,y) (((x)+(y))-(int) ((x)+(y)))  /* (x+y) mod 1.0 */ 
double ran_u[KK] ; /* the generator state */ 


void ranf_array(double aa[],int n) { /* aa gets n random fractions */ 
register int i,j; 
for (j=0;j<KK; j++) aa[j]=ran_u[j]; 
for (;j<n;j++) aa[j]=mod_sum(aa[j-KK] ,aa[j-LL]); 
for (i=0;i<LL;it+,j++) ran_u[i]=mod_sum(aa[j-KK] ,aa[j-LL]) ; 
for (;i<KK;it+,j++) ran_u[i]=mod_sum(aa[j-KK] ,ran_u[i-LL]) ; 


} 

#define TT 70 /* guaranteed separation between streams */ 
#define is_odd(s) ((s)&1) 

void ranf_start(long seed) { /* do this before using ranf_array */ 


register int t,s,j; 

double u[KK+KK-1]; 

double ulp=(1.0/(1L<<30))/(1L<<22); /* 2 to the -52 */ 
double ss=2.0*ulp*((seed&Ox3fffffff)+2); 


for (j=0;j<KK;j++) { 


u[j]=ss; /* bootstrap the buffer */ 
SSt=ss; 
if (ss>=1.0) ss-=1.0-2*ulp; /* cyclic shift of 51 bits */ 
} 
u[1]+=ulp; /* make u[1] (and only u[1]) "odd" */ 


for (s=seedk0x3fffffff,t=TT-1; t; ) { 
for (j=KK-1;j>0;j--) 
ulj+j]=ulj],ulj+j-1]=0.0; /* "square" */ 
for (j=KK+KK-2;j>=KK;j--) { 
u[j-(KK-LL)]=mod_sum(u[j-(KK-LL)] ,ulj]) ; 
u[j-KK] =mod_sum(u[j-KK] ,uLj]); 


F 

if (is_odd(s)) { /* "multiply by z" */ 
for (j=KK;j>0;j--) u[j]=u[j-1]; 
u[0]=u[KK] ; /* shift the buffer cyclically */ 
u[LL]=mod_sum (u [LL] ,u [KK] ) ; 

} 


if (s) s>>=1; else t--; 
} 
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for (j=0;j<LL; j++) ran_u[j+KK-LL]=u[j] ; 
for (;j<KK;j++) ran_u[j-LL]=u[j] ; 
for (j=0;j<10; j++) ranf_array(u,KK+KK-1); /* warm everything up */ 
} 
int main() { /* a rudimentary test */ 
register int m; 
double a[2009]; 
ranf_start (310952) ; 
for (m=0;m<2009;m++) 
ranf_array(a,1009) ; 
printf("%.20f\n", ran_u[0]); /* 0.36410514377569680455 */ 
ranf_start (310952) ; 
for (m=0;m<1009;m++) 
ranf_array(a,2009) ; 
printf("%.20f\n", ran_u[0]); /* 0.36410514377569680455 */ 
return 0; 


} 


12. A simple linear congruential generator like (1) would fail, because m would be much 
too small. Good results are possible by combining three (not two) such generators, 
with multipliers and moduli (157, 32363), (146,31727), (142,31657), as suggested by 
P. L’Ecuyer in CACM 31 (1988), 747-748. However, the best method is probably to 
use the C programs ran_array and ran-_start, with the following changes to keep all 
numbers in range: ‘long’ becomes ‘int’; ‘MM’ is defined to be ‘(1U<<15)’; and the type 
of variable ss should be unsigned int. This generates 15-bit integers, all of whose bits 
are usable. The seed is now restricted to the range [0..32765]. The “rudimentary test 
routine” will print X1009x2009 = 24130, given the seed 12509. 


13. A program for subtract-with-borrow would be very similar to ran_array, but slower 
because of the carry maintenance. As in exercise 11, floating point arithmetic could 
be used with perfect accuracy. It is possible to guarantee disjointness of the sequences 
produced from different seeds s by initializing the generator with the (—n)th element 
of the sequence, where n = 2°; this requires computing b” mod (b! —b' + 1). Squaring 
a radix-b number mod b* — b! + 1 is, however, considerably more complicated than the 
analogous operation in program ran_start, and for k in a practical range it takes about 
k' © operations instead of O(k). 

Both methods probably generate sequences of the same quality in practice, when 
they have roughly the same value of k. The only significant difference between them 
is a better theoretical guarantee and a provably immense period for the subtract-with- 
borrow method; the analysis of lagged Fibonacci generators is less complete. Experience 
shows that we should not reduce the value of k in subtract-with-borrow just because of 
these theoretical advantages. When all is said and done, lagged Fibonacci generators 
seem preferable from a practical standpoint; the subtract-with-borrow method is then 
valuable chiefly because of the insight it gives us into the excellent behavior of the 
simpler approach. 


14. We have Xn+4200 = (Xn + Xn+126) (modulo 2); see exercise 3.2.2-32. Hence 
Yn+100 = Yn + Yn+26 when n mod 100 > 73. Similarly Xn+200 = Xn + Xn+26 + Xn+893 
hence Yn+100 = Yn +Yn+26 + Yn+89 when n mod 100 < 11. Thus Yn+100 is a sum of only 
two or three elements of {Yn,..., Yn+99}, in 26% + 11% of all cases; a preponderance 
of Os will then tend to make Yn+100 = 0. 
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More precisely, consider the sequence (u1, u2,...) = (126,89, 152, 115, 78,..., 100, 
63, 126,...) where un41 = Un — 37 + 100[un < 100]. Then we have 


Xn4200 = (Xn + Xngo, + + Xntox_o + Xu,_,) mod 2, 


where vj = uj + (—1) ls 2100] 100; for example, Xn+200 = Xn + Xn+26 + Xn+is9 + 
Xn4152 = Xn + Xn+26 + Xn+189 + Xn452 + Xn4115. If the subscripts are all < n+ t 
and > n+100+t, we obtain a k-term expression for Yn+100 when n mod 100 = 100 -t, 
for 1 < t < 100. The case t = 63 is an exception, because Xn + Xn41 +- 
Xn+62 + Xn+163 + Xn+164 +++: + Xn+199 = 0; in this case Yn+100 is independent of 
{Yn,...,Yn+99}. The case t = 64 is interesting because it gives the 99-term relation 
Yn+100 = Yn+1 + Yn+2 +--+ + Yn+99; this tends to be 0 in spite of the large number of 
terms, because most of the 100-tuples that have 40 or fewer 1s have even parity. 
When there is a k-term relation, the probability that Yn+100 = 1 is 


m= SSCA) Groa EC). 


The quantity t takes the values 100, 99, ..., 1, 100, 99, ..., 1, ... as bits are printed; 
so we find that the expected number of 1s printed is 10° (26p2 + 11p3 + 26p4 + 11pe 4 
11po + 4pi2 + 4p20 + 3p28 + paz + p74 + pog + 1/2) /100 ~ 14043. The expected number of 
digits printed is 10° 3 (+09) J299 = 28444, so the expected number of 0s is ~ 14401. 

The detectable bias goes away if more elements are discarded. For example, if we 
use only 100 elements of ran_array(a,300), the probability can be shown to be (26p5 + 
22ps+19pı10 +: - - )/100; with ran_array(a,400) it is worse, (15p3+37p6+15p9+---)/100, 
because Xn+400 = Xn + Xn+4252. With ran_array(a,1009) as recommended in the text 
we have (17p7+10p11+2p12+---)/100, which can only be detected by such experiments 
if the threshold for printing is raised from 60 to, say, 75; but then the expected number 
of outputs is only about 0.28 per million trials. 

[This exercise is based on ideas of Y. Kurita, H. Leeb, and M. Matsumoto, com- 
municated to the author in 1997.] 


15. The following program makes it possible to obtain a new random integer quickly 
with the expression ran_arr_next(), once ran_start has been called to get things started: 


#define QUALITY 1009 /* recommended quality level for high-res use */ 
#define KK 100 /* the long lag */ 
long ran_arr_buf [QUALITY] ; 
long ran_arr_sentinel=-1; 
long *ran_arr_ptr=kran_arr_sentinel; /* the next random number, or -1 */ 
#define ran_arr_next() (*ran_arr_ptr>=0? *ran_arr_ptr++: ran_arr_cycle()) 
long ran_arr_cycle() 
{ 

ran_array(ran_arr_buf , QUALITY); 

ran_arr_buf [KK]=-1; ran_arr_ptr=ran_arr_buf+1; 

return ran_arr_buf [0]; 


} 


Reset ran_arr_ptr = &ran_arr_sentinel if ran_start is used again. 


SECTION 4.1 
1. (1010), (1011)-2, (1000)-2, ..., (11000) 2, (11001), (11110)_». 
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2. (a) —(110001)2, —(11.001001001001 . ..)2, (11.00100100001111110110101 .. . )2. 

b) (11010011)-2, (1101.001011001011...)—2, (111.0110010001000000101 . . . )—2. 
c) (11111), (10.011011011011 ...)3, (10.01111117100010111110111111110... J3. 
d) —(9.4)1/10;, —(.-. 7582417582413)1 10, (...3462648323979853562951413) 1/10. 


3. (1010113.2)a;. 


—~ —~ 


4. (a) Between rA and rX. (b) The remainder in rX has radix point between bytes 
3 and 4; the quotient in rA has radix point one byte to the right of the least significant 
portion of the register. 


5. It has been subtracted from 999...9 = 10” — 1, instead of from 1000...0 = 10”. 
6. (a,c) 2P71 — 1, —(2?-1 — 1); (b) 2271 — 1, —2?7?. 


7. A ten’s complement representation for a negative number x can be obtained by 
considering 10” + x (where n is large enough for this to be positive) and extending it 
on the left with infinitely many nines. The nines’ complement representation can be 
obtained in the usual manner. (These two representations are equal for nonterminating 
decimals, otherwise the nines’ complement representation has the form ...(a)99999... 
while the ten’s complement representation has the form ...(a + 1)0000....) The 
representations may be considered sensible if we regard the value of the infinite sum 
N = 9 + 90 + 900 + 9000 + --- as —1, since N — 10N = 9. 

See also exercise 31, which considers p-adic number systems. The latter agree with 
the p’s complement notations considered here, for numbers whose radix-p representation 
is terminating, but there is no simple relation between the field of p-adic numbers and 
the field of real numbers. 


8. oy ajb! = Yj (ang+e—1b"* +--+ apj )b™. 


9. A BAD ADOBE FACADE FADED. [Note: Other possible “number sentences” would be 
DO A DEED A DECADE; A CAD FED A BABE BEEF, COCOA, COFFEE; BOB FACED A DEAD DODO.] 


10 aa a ne el _ ...,A3, A2, A1, Ao; A_1, A~a,... if 
Š ..., b3, ba, b1, bo; b_1, b-2,... ..., B3, B2, Bi, Bo; Bı, B_2,... a 
1AA PE e ER PEE L EEES A T 
Boe oi i fo ia ENES 
A; = 2 B; = bp, ay ote bk; 
J Diej44—23+++5 Ok ? J j+1 j? 


where (kn) is any doubly infinite sequence of integers with kj41 > kj and ko = 0. 


11. (The following algorithm works both for addition or subtraction, depending on 
whether the plus or minus sign is chosen.) 


Start by setting k < an+ı + an+2 <— bn4i + bn+2 + 0; then for m = 0, 1, 

.., n + 2 do the following: Set Cm <— am + bm + k; then if cm > 2, set k + —1 and 

Cm < Cm — 2; otherwise if Cm < 0, set k + 1 and cm + Cm + 2; otherwise (namely if 
0 < cm <1), set k <0. 


12. (a) Subtract +(...a30a10)-2 from (...a40a20a0)—-2 in the negabinary system. 
(See also exercise 7.1.3-7 for a trickier solution that uses full-word bitwise operations.) 
(b) Subtract (...b30b10)2 from (...b40b20bo)2 in the binary system. 


13. (1.909090...)—10 = (0.090909... )-10 = $. 
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14. 11321 [5- 4% 74421 ee 
11321 [5 — 4i] 5 5 
11321 
11202 
12123 
11321 
11321 
010311201 [9 — 40i] ; ; 
—4—8i |) 1-8 
5 5 
Fig. A-6. Fundamental region 
15. [- 4 bx ih and the rectangle on the right. for quater-imaginary numbers. 


16. It is tempting to try to do this in a very simple way, by using the rule 2 = (1100);-1 
to take care of carries; but that leads to a nonterminating method if, for example, we 
try to add 1 to (11101);_; = —1. 

The following solution does the job by providing four related algorithms (namely 
for adding or subtracting 1 or i). If a is a string of zeros and ones, let a” be a string 
of zeros and ones such that (a”);-1 = (@)i-1 + 1; and let a7”, a®, a7? be defined 
similarly, with —1, +i, and —i respectively in place of +1. Then 


(a0)” = al; (arl)” = a? x0. (a0)? = a”1; (a1)? = a720. 
(axz0) ? =a7@x1; (a1)? = a0. (a0)? =a81; (a1)? =a7”0. 


Here x stands for either 0 or 1, and the strings are extended on the left with zeros 
if necessary. The processes will clearly always terminate. Hence every number of the 
form a + bi with a and b integers is representable in the 7 — 1 system. 


17. No (in spite of exercise 28); the number —1 cannot be so represented. This can 
be proved by constructing a set S as in Fig. 1. We do have the representations —i = 
(0.1111... )a4e, i = (100.1111... .)i4i. 


18. Let So be the set of points (a7aga5a4a3a2a1d0)i—1, where each ax is 0 or 1. (Thus, 
So is given by the 256 interior dots shown in Fig. 1, if that picture is multiplied by 16.) 
We first show that S is closed: If {y1,y2,...} is an infinite subset of S, we have 
Yn = Say ankl6~*, where each ank is in So. Construct a tree whose nodes are 
(Qni,---,@nr), for 1 < r < n, and let a node of this tree be an ancestor of another node 
if it is an initial subsequence of that node. By the infinity lemma (Theorem 2.3.4.3K) 
this tree has an infinite path (a1, a2, a3,...); consequently X >: az 16—* is a limit point 
of {yi, y2,.-.} in S. i 

By the answer to exercise 16, all numbers of the form (a+bi)/16” are representable, 
when a and b are integers. Therefore if x and y are arbitrary reals and k > 1, the 
number zp = (|16*«| + |16*y|i)/16* is in S +m + ni for some integers m and n. It 
can be shown that S + m + ni is bounded away from the origin when (m,n) # (0,0). 
Consequently if |x| and |y| are sufficiently small and k is sufficiently large, we have 
ze E€ S, and limp Zk = £x + yi is in S. 

[B. Mandelbrot named S the “twindragon” because he noticed that it is essentially 
obtained by joining two “dragon curves” belly-to-belly; see his book Fractals: Form, 
Chance, and Dimension (San Francisco: Freeman, 1977), 313-314, where he also stated 
that the dimension of the boundary is 21g x ~ 1.523627, where « = 1+2x7? ~ 1.69562. 
Other properties of the dragon curve are described in C. Davis and D. E. Knuth, J. Recr. 
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Math. 3 (1970), 66-81, 133-149. The sets S for digits {0,1} and other complex bases 
are illustrated and analyzed by D. Goffinet in AMM 98 (1991), 249-255.] 

I. Kátai and J. Szabó have shown that the radix —d+i yields a number system with 
digits {0,1,..., d°}; see Acta Scient. Math. 37 (1975), 255-260. Further properties of 
such systems have been investigated by W. J. Gilbert, Canadian J. Math. 34 (1982), 
1335-1348; Math. Magazine 57 (1984), 77-81. Another interesting case, with digits 
{0,1,i,—1,—i} and radix 2+ i, has been suggested by V. Norton [Math. Magazine 
57 (1984), 250-251]. For studies of number systems based on more general algebraic 
integers, see I. Kátai and B. Kovacs, Acta Math. Acad. Sci. Hung. 37 (1981), 159-164, 
405-407; B. Kovacs, Acta Math. Hung. 58 (1991), 113-120; B. Kovács and A. Pethő, 
Studia Scient. Math. Hung. 27 (1992), 169-172. 


19. If m > u or m < l, find a € D such that m = a (modulo b); the desired 
representation will be a representation of m’ = (m — a)/b followed by a. Note that 
m > u implies | < m’ < m; m < l implies m < m’ < u; so the algorithm terminates. 

[There are no solutions when b = 2. The representation will be unique if and only 
if 0 € D; nonunique representation occurs for example when D = {—3,—1,7}, b = 3, 
since (a)3 = (3773a)3. When b > 3 it is not difficult to show that there are exactly 
2°-% solution sets D in which |a| < b for all a € D. Furthermore the set D = {0, 1, 
2 — 2b", 3 — €3b”, ..., b— 2 — e_2b”, b — 1 — b” } gives unique representations, for all 
b > 3 and n > 1, when each e; is 0 or 1. References: Proc. IEEE Symp. Comp. Arith. 
A (1978), 1-9; JACM 29 (1982), 1131-1143] 


20. (a) 0-111 -e+ = 1.888... = 18.447... = 184.823 ... = «+» = IBRIDE T ... 
has nine representations. (b) A “D-fraction” .a1a2 ... always lies between —1/9 and 
+71/9. Suppose x has ten or more D-decimal representations. Then for sufficiently 
large k, 10" has ten representations that differ to the left of the decimal point: 10°2 = 
ny + fi = = nio + fio where each f; is a D-fraction. By uniqueness of integer 
representations, the nj are distinct, say nı < +- < nio, hence nio — nı > 9; but this 
implies fı — fio > 9 > 71/9 — (—1/9), a contradiction. (c) Any number of the form 
0.a1a2..., where each a; is —1 or 8, equals I.aia5 ... where aj, = a; + 9 (and it even 


has six more representations 18.a/a3..., etc.). 


21. We can convert to such a representation by using a method like that suggested in 
the text for converting to balanced ternary. 

In contrast to the system of exercise 20, zero can be represented in infinitely 
many ways, all obtained from 4 + )>,.,(—4$)-10~* (or from the negative of this 
representation) by multiplying it by a power of ten. The representations of unity are 
15—$*, $+ 4*,5-3$-—4*, 5-454 $*, 50-45-35 — $*, 50-45-45 + 4%, etc., 


where +3* = (+44)(1071 + 107? + ---). [AMM 57 (1950), 90-93.] 


22. Given some approximation bn ...bibo with error }`p—o bk 10% — xz > 10~* for t > 0, 
we will show how to reduce the error by approximately 10~*. (The process can be 
started by finding a suitable X` %—o bk 10% > a; then a finite number of reductions of 
this type will make the error less than €.) Simply choose m > n so large that the 
decimal representation of —10™a has a one in position 107* and no ones in positions 
10-*t', 10-**?, ..., 10". Then 10a + (a suitable sum of powers of 10 between 10” 
and 10”) + Ðo bel0* ~ Pro bn 10* — 107°. 


23. The set S = {37,5,axnb-" | ax € D} is closed as in exercise 18, hence it is 
measurable, and in fact it has positive measure. Since bS = U,-p(a+ S), we have 
bu(S) = u(bS) < Veep ela + S) = Saen MS) = bu(S), and we must therefore have 
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u((at+ S) N (a' + S)) = 0 when a 4a’ € D. Now T has measure zero if 0 € D, since 
T is a union of countably many sets of the form b(n + ((a + S) A (a' + S))), a 4a’, 
each of measure zero. On the other hand, as pointed out by K. A. Brakke, every real 
number has infinitely many representations in the number system of exercise 21. 

[The set T cannot be empty, since the real numbers cannot be written as a 
countable union of disjoint, closed, bounded sets; see AMM 84 (1977), 827-828, and the 
more detailed analysis by Petkovšek in AMM 97 (1990), 408-411. If D has fewer than b 
elements, the set of numbers representable with radix b and digits from D has measure 
zero. If D has more than b elements and represents all reals, T has infinite measure.] 
24. {2a- 10" +a'|0<a<5,0 <a < 2} or {5a'- 10" +a|0<a<5,0<a' < 2}, 
for k > 0. [R. L. Graham has shown that there are no more sets of integer digits 
with these properties. And Andrew Odlyzko has shown that the restriction to integers 
is superfluous, in the sense that if the smallest two elements of D are 0 and 1, all 
the digits must be integers. Proof. Let S = {>>,2,axb" | ax € D} be the set of 
“fractions,” and let X = {(an...a0)» | ak E D} be the set of “whole numbers”; then 
[(0..0c0) = U,ex(# + 5S), and (x + S$) (a’ + S) has measure zero for « # x' € X. We 


have (0..1) C S, and by induction on m we will prove that (m..m +1) C £m + S for 
some Lm E X. Let £m E€ X be such that (m..m + €) N (£m + S) has positive measure 
for alle > 0. Then £m < m, and £m must be an integer lest ziem] +S overlap £m + S 
too much. If £m > 0, the fact that (m — £m .. Mm — £m + 1) N S has positive measure 
implies by induction that this measure is 1, and (m..m+1) C £m +S since S is closed. 
If tm = 0 and (m..m+1) Z S, we must have m < £m < m + 1 for some £m € X, 
where (m.. xm) C S; but then 1 + S overlaps zia + S. See Proc. London Math. Soc. 
(3) 18 (1978), 581-595.] 

Note: If we drop the restriction 0 € D, there are many other cases, some of which 
are quite interesting, especially {1,2,3,4,5,6,7,8,9,10}, {1,2,3,4,5,51, 52, 53, 54, 55}, 
and {2,3,4,5,6,52,53,54,55,56}. Alternatively if we allow negative digits we obtain 
many other solutions by the method of exercise 19, plus further sets of unusual digits 
like {—1,0,1,2,3,4,5,6,7,18} that don’t meet the conditions stated there. It appears 
hopeless to find a nice characterization of all solutions with negative digits. 


25. A positive number whose radix-b representation has m consecutive (b—1)’s to the 
right of the radix point must have the form c/b” + (b™ — 0)/b"+™, where c and n are 
nonnegative integers and 0 < 0 < 1. So if u/v has this form, we find that b”*"u = 
b™cv+b™v—6v. Therefore Ov is an integer that is a multiple of b”. But 0 < 0v < v < 
b™. [There can be arbitrarily long runs of other digits a, if 0 < a < b — 1, for example 
in the representation of a/(b — 1).] 


26. The proof of “sufficiency” is a straightforward generalization of the usual proof for 
base b, by successively constructing the desired representation. The proof of “necessity” 
breaks into two parts: If 6n41 is greater than )°,~,, Ckbk for some n, then Bn41— € 
has no representation for small e. If Bn41 < X pcn Ckbk for all n, but equality does 
not always hold, we can show that there are two representations for certain x. [See 
Transactions of the Royal Society of Canada, series III, 46 (1952), 45-55.] 


27. Proof by induction on |n|: If n is even we must take eo > 0, and the result follows by 
induction, since n/2 has a unique such representation. If n is odd, we must take eo = 0, 
and the problem reduces to representing —(n—1)/2; if the latter quantity is either zero 
or one, there is obviously only one way to proceed, otherwise it has a unique reversing 
representation by induction. [A. D. Booth, in Quarterly J. Mechanics and Applied 
Math. 4 (1951), 236-240, applied this principle to two’s complement multiplication. ] 
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[It follows that every positive integer has exactly two such representations with 
decreasing exponents eo > e1 > --- > e¢: one with t even and the other with t odd.] 


28. A proof like that of exercise 27 may be given. Note that a + bi is a multiple of 
1 +i by a complex integer if and only if a + b is even. This representation is intimately 
related to the dragon curve discussed in the answer to exercise 18. 


29. It suffices to prove that any collection {To, 71, T2,...} satisfying Property B may 
be obtained by collapsing some collection { S0, 91, .S2,...}, where So = {0,1,...,b—1} 
and all elements of S1, S2, ... are multiples of b. 

To prove the latter statement, we may assume that 1 € To and that there is a 
least element b > 1 such that b ¢ To. We will prove, by induction on n, that if nb ¢ To, 


then nb + 1, nb +2, ..., nb+b-— 1 are not in any of the T}’s; but if nb € To, then so 
are nb + 1, ..., nb+ b— 1. The result then follows with Sı = {nb | nb € To}, S2 = Ti, 
S3 = To, etc. 


If nb ¢ To, then nb = to + tı +--+, where tı, te, ... are multiples of b; hence 
to < nb is a multiple of b. By induction, (to + k) + tı + t2 +- -- is the representation of 
nb + k, for 0 < k < b; hence nb + k ¢ T; for any j. 


If nb € To and 0 < k < b, let the representation of nb + k be to +ti+---. We 
cannot have t; = nb + k for j > 1, lest nb + b have two representations (b — k) +- -- + 
(nb+k)+--- = (nb)+---+6+---. By induction, to mod b = k; and the representation 


nb = (to — k) + tı +--+ implies that to = nb + k. 
[Reference: Nieuw Archief voor Wiskunde (3) 4 (1956), 15-17. A finite analog of 
this result was derived by P. A. MacMahon, Combinatory Analysis 1 (1915), 217-223.] 


30. (a) Let A; be the set of numbers n whose representation does not involve bj; then 
by the uniqueness property, n € A; if and only if n +b; ¢ A;. Consequently we have 
n € A; if and only if n+ 2b; € Aj. It follows that, for j Æ k, n € Aj N Ak if and 
only if n + 2b;b, E€ A; N Ak. Let m be the number of integers n E€ A;M Ax such 
that 0 < n < 2b;bg. Then this interval contains exactly m integers that are in A; 
but not Ax, exactly m in Ax but not Aj, and exactly m in neither A; nor Ax; hence 
4m = 2b;by. Therefore bj and by cannot both be odd. But at least one b; is odd, of 
course, since odd numbers can be represented. 

(b) According to (a) we can renumber the b’s so that bo is odd and 61, b2, ... are 
even; then ibi, dbo, ... must also be a binary basis, and the process can be iterated. 

(c) If it is a binary basis, we must have positive and negative dx’s for arbitrarily 
large k, in order to represent +2” when n is large. Conversely, the following algorithm 
may be used: 


S1. [Initialize] Set k + 0. 
S2. [Done?] If n = 0, terminate. 


S3. [Choose.] If n is even, set n + n/2. Otherwise include 2d; in the represen- 
tation, and set n + (n — dx)/2. 


S4. [Advance k.] Increase k by 1 and return to S2. J 


At each step the choice is forced; furthermore step S3 always decreases |n| unless 
n = —dx, hence the algorithm must terminate. 

(d) Two iterations of steps S2-S4 in the preceding algorithm will change 4m > m, 
4m+1— 4 m-+5, 4m+2 > m-+7,4m+3— 4 m-—1. Arguing as in exercise 19, we 
need only show that the algorithm terminates for —2 < n < 8; all other values of n are 
moved toward this interval. In this range 3 => —1 > —2 > 6 > 8 > 2 —> 7 —> 0 and 
441 45-6. Thus 1=7-2°—13-2'+7-2? —13-2% —13-2° —13-294+7-2"°. 
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Note: The choice do, di, d2, ... = 5, —3, 3, 5, —3, 3, ... also yields a binary basis. 
For further details see Math. Comp. 18 (1964), 537-546; A. D. Sands, Acta Math. 
Acad. Sci. Hung. 8 (1957), 65-86. 


31. (See also the related exercises 3.2.2-11, 4.3.2-13, 4.6.2-22.) 

(a) By multiplying numerator and denominator by suitable powers of 2, we may 
assume that u = (...u2uiuo)2 and v = (...v2v1v0)2 are 2-adic integers, where vo = 1. 
The following computational method now determines w, using the notation u™ to 
stand for the integer (un—1...uo)2 = u mod 2” when n > 0: 

Let wo = uo and w = wo. For n = 1, 2, ..., assume that we have found 
an integer w' = (wn—1...wo)2 such that u™ = v™w™ (modulo 2”). Then we 
have u("tY) = y+) w™ (modulo 2”), hence wn = 0 or 1 according as the quantity 
(ult) — yD) mod 2"*7 is 0 or 2”. 

(b) Find the smallest integer k such that 2” = 1 (modulo 2n + 1). Then we have 
1/(2n +1) = m/(2* — 1) for some integer m, 1 < m < 2"~1. Let a be the k-bit binary 
representation of m; then (0.aaa...)2 times 2n + 1 is (0.111...)2 = 1 in the binary 
system, and (...aaa)2 times 2n + 1 is (...111)2 = —1 in the 2-adic system. 

(c) If u is rational, say u = m/(2°n) where n is odd and positive, the 2-adic 
representation of u is periodic, because the set of numbers with periodic expansions 
includes —1/n and is closed under the operations of negation, division by 2, and 
addition. Conversely, if uw+. = uw for all sufficiently large N, the 2-adic number 
(2* — 1)2"u is an integer for all sufficiently large r. 

(d) The square of any number of the form (...u2u11)2 has the form (...001)2, 
hence the condition is necessary. To show the sufficiency, we can use the following 
procedure to compute v = yn when n mod 8 = 1: 

H1. [Initialize.] Set m + (n — 1)/8, k + 2, vo + 1, vı + 0, v + 1. (During this 

algorithm we will have v = (vk—1 . . . v1v0)2 and v =n- 2kthm,) 


H2. [Transform.] If m is even, set vp < 0, m «+ m/2. Otherwise set vk + 1, 
m + (m—v—2*-1)/2, v e v + 2. 
H3. [Advance k.] Increase k by 1 and return to H2. J 
32. A more general result appears in Math. Comp. 29 (1975), 84-86. 


33. Let Kn be the set of all such n-digit numbers, so that kn = |K,|. If S and T are 
any finite sets of integers, we shall say S ~ T if S = T + x for some integer x, and we 
shall write kn(S) = |Kn(S)|, where Kn(S) is the family of all subsets of Kn that are 
~ S. When n = 0, we have k,(S) = 0 unless |S| < 1, since zero is the only “0-digit” 
number. When n > 1 and S = {s,,...,8,}, we have 


Kn(S) = U U {{tib+a1,. a „trb + ar} | 


O<j<b (aq,...,ar) 
{ti, Sie tr} E Kn-1({(si +3 — ai)/b | 1 < i < r})}, 


where the inner union is over all sequences of digits (ai1,...,ar) satisfying the con- 
dition a; = s; + j (modulo b) for 1 < i < r. In this formula we require ti — ty = 
(si — ai)/b — (sy — ay)/b for 1 < i < i’ < r, so that the naming of subscripts is 
uniquely determined. By the principle of inclusion and exclusion, therefore, we have 
ka (S) = Mo<jes Em>1( 71)" (S, m, j), where f(S,m, j) is the number of sets of 
integers that can be expressed as {tıb + ai,...,t-b + ar} in the manner above for 
m different sequences (a1,...,ar), summed over all choices of m different sequences 


(a1,...,@r). Given m different sequences (aP, hs „a®) for 1 <1 < m, the number of 
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such sets is kn—1({(si + j — al?) /b |1<i<r,1 <l<m}). Thus there is a collection 
of sets T(S) such that 
w= E eee), 


TET(S) 


where each cr is an integer. Furthermore if T € T(S), its elements are near those 
of S; we have min T > (min S — max D)/b and max T < (max S +b — 1 — min D)/b. 
Thus we obtain simultaneous recurrence relations for the sequences (k,,(S)), where S 
runs through the nonempty integer subsets of [l.. u+ 1], in the notation of exercise 19. 
Since kn = kn(S) for any one-element set S, the sequence (kn) appears among these 
recurrences. The coefficients cr can be computed from the first few values of kn (S), 
so we can obtain a system of equations defining the generating functions ks(z) = 
D kn(S)z” = [[S| < 1] + 2 rers) erkr(z). [See J. Algorithms 2 (1981), 31-43.] 

For example, when D = {—1,0,3} and b = 3 we have | = -3 and u = z, so the 
relevant sets S are {0}, {0,1}, {—1,1}, and {—1,0,1}. The corresponding sequences 
for n < 3 are (1,3,8,21), (0, 1,3,8), (0,0,1,4), and (0,0,0,0); so we obtain 


ko(z) = 1 + 2(3ko(z) — ko1(z)), ko2(z) = 2(ko1(z) + ko2(2)), 
koi(z) = zko(z), ko12(z) = 0, 
and k(z) = 1/(1 — 3z + 2°). In this case kn = Fon+2 and kn({0,2}) = Fon-1 — 1. 


34. There is exactly one string an on the symbols {1,0,1} such that n = (an)2 and 
Qn has no leading zeros or consecutive nonzeros: ao is empty, otherwise aan = a,0, 
Qant1 = An01, A4n—1 = a, 01. Any string that represents n can be converted to 
this “canonical signed bit representation” by using the reductions 11 > 01, T1 — OT, 
01...11 — 10...01, 01...11 —> 10...01, and inserting or deleting leading zeros. 
Since these reductions do not increase the number of nonzero digits, a, has the fewest. 
[Advances in Computers 1 (1960), 244-260.) The number of nonzero digits in an, 
denoted by P(n), is the number of 1s in the ordinary representation that are immediately 
preceded by 0 or by the substring 00(10)"1 for some k > 0. (See exercise 7.1.3-35.) 

A generalization to radix b > 2 has been given by J. von zur Gathen, Computa- 
tional Complexity 1 (1991), 360-394. 


SECTION 4.2.1 


1. N = (62,+.60 22 14 00); h = (37,+.66 26 10 00). Note that the quantity 10h 
would be (38, +.06 62 61 00). 


2. b7-4(1 — b-?), b-9-P; BF -4(1 — bP), BoA. 
3. When e does not have its smallest value, the most significant “one” bit (which 
appears in all such normalized numbers) need not appear in the computer word. 


4. (51, +.10209877); (50, +.12346000); (53, +.99999999). The third answer would be 
(54, +.10000000) if the first operand had been (45, —.50000000), since b/2 is odd. 


5. If x ~ y and m is an integer then mb + x ~ mb + y. Furthermore x ~ y implies 
x/b ~ y/b, by considering all possible cases. Another crucial property is that x and y 
will round to the same integer, whenever bx ~ by. 

Now if b-?-?F, # fy we must have (b?*? f,) mod b Æ 0; hence the transformation 
leaves fy unchanged unless eu — ev > 2. Since u was normalized, it is nonzero and 
|fu + fol > b> — b7? > b-*: The leading nonzero digit of fu + fy must be at most 
two places to the right of the radix point, and the rounding operation will convert 
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bPtI( fu + fo) to an integer, where j < 1. The proof will be complete if we can show 
that b?TIt"(f. + fo) ~ b? tt (f. + bP? 7 F,). By the previous paragraph, we have 
bP*?( fu + fo) ~ BPH? fu + Fy = bP? (fu + b>? °F), which implies the desired result 
for all j < 1. Similar remarks apply to step M2 of Algorithm M. 

Note that, when b > 2 is even, such an integer F, always exists; but when b = 2 
we require p+ 3 bits (let 2F, be an integer). When b is odd, an integer F, always exists 
except in the case of division by Algorithm M, when a remainder of ib is possible. 


6. (Consider the case eu = ev, fu = —fv in Program A.) Register A retains its 
previous sign, as in ADD. 


7. Say that a number is normalized if and only if it is zero or its fraction part lies in the 
range 7 < |f| < i. A (p+ 1)-place accumulator suffices for addition and subtraction; 
rounding (except during division) is equivalent to truncation. A very pleasant system 
indeed! We might represent numbers with excess-zero exponent, inserted between the 
first and subsequent digits of the fraction, and complemented if the fraction is negative, 


so that the order of fixed point numbers is preserved. 


8. (a) (06, +.12345679) ® (06, —.12345678), (01, +.10345678) ® (00, —.94000000); 
(b) (99, +.87654321) @ itself, (99, +.99999999) ® (91, +.50000000). 


9. a = c = (—50, +.10000000), b = (—41, +.20000000), d = (—41, +.80000000), y = 
(11, +.10000000). 


10. (50, +.99999000) ® (55, +.99999000). 
11. (50,-+.10000001) ® (50, +.99999990). 


12. If 0 < [ful < |fol, then |ful < [fol — b7; hence 1/6 < |fu/ fol < 1-6" /Ifel < 
1—b-?. IFO < [fol < |ful, we have 1/b < |fu/fol/b < (1 — 0?) /(1/b))/b = 1 — 0". 


13. See J. Michael Yohe, IEEE Trans. C-22 (1973), 577-586; see also exercise 4.2.2—24. 


14. FIX STJ 9F Float-to-fix subroutine: 
STA TEMP 
LD1 TEMP(EXP) rll<e. 
SLA 1 rActfff fo. 
JAZ OF Is input zero? 
DEC1 1 
CMPA =0=(1:1) If leading byte is zero, 
JE *-4 shift left again. 
ENN1 -Q-4,1 
JiN FIXOVFLO Is magnitude too large? 
ENTX 0 
SRAX 0,1 
CMPX =1//2= 
JL OF 
JG *+2 
JAO 9F The ambiguous case becomes odd, since b/2 is even. 
STA *+1(0:0) Round, if necessary. 
INCA 1 Add 1 (overflow is impossible). 

9H JMP * Exit from subroutine. J 

15. FP STJ EXITF Fractional part subroutine: 

JOV OFLO Ensure that overflow is off. 


STA TEMP TEMP + u. 
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ENTX 0 

SLA 1 rA & fu. 

LD2 TEMP(EXP) rl2 ¢ eu. 

DEC2 Q 

J2NP *+3 

SLA 0,2 Remove integer part of u. 

ENT2 0 

JANN 1F 

ENN2 0,2 Fraction is negative: Find 

SRAX 0,2 its complement. 

ENT2 0 

JXNZ *+3 

JAZ *+2 

INCA 1 

ADD WM1 Add word size minus one. 
1H INC2 Q Prepare to normalize the answer. 

JMP NORM Normalize, round, and exit. 


8H EQU 1(1:1) 

WM1 CON 8B-1,8B-1(1:4) Word size minus one J 
16. If |c| > |d|, then set r = d@c, s | c@(r@d;x+ (aG(b@r))Os,ye 
(bS (a@r)) s. Otherwise set r+ cQd,s+d@(r@c); x ((a@r) Ob) 2s, 
y + ((b@r) Ga) @s. Then z + iy is the desired approximation to (a + bi)/(c + di). 
Computing s’ + 1@s and multiplying twice by s’ may be better than dividing twice 
by s. As with (11), gradual underflow is recommended for the calculation of r unless 
special precautions are taken. [CACM 5 (1962), 435. Other algorithms for complex 
arithmetic and function evaluation are given by P. Wynn, BIT 2 (1962), 232-255. For 
|a + bi|, see Paul Friedland, CACM 10 (1967), 665.] 
17. See Robert Morris, IEEE Trans. C-20 (1971), 1578-1579. Error analysis is more 
difficult with such systems, so interval arithmetic is correspondingly more desirable. 
18. For positive numbers: Shift fraction left until fı = 1, then round, then if the 
fraction is zero (rounding overflow) shift it right again. For negative numbers: Shift 
fraction left until fı = 0, then round, then if the fraction is zero (rounding underflow) 
shift it right again. 
19. (73—(5—[rounding digits are £0...0])(6—[magnitude is rounded up])+[e.<eu]+ 
[first rounding digit is ]—[fraction overflow]—10[result zero]+7[rounding overflow]+ 
7N + (3 + (16 + [result negative])[opposite signs]) X)u, where N is the number of left 
shifts during normalization, and X is the condition that rX receives nonzero digits and 
there is no fraction overflow. The maximum time of 84u occurs for example when 


u = —50 01 00 00 00, v=+45 49999999, b=100. 


[The average time, considering the data in Section 4.2.4, will be less than 47u.] 


SECTION 4.2.2 

1. uov =u v = -v 9u = -(v 9 —u) (vou). 

2. u@xrx > u90 = u, by (8), (2), (6); hence by (8) again, (u B 1) Gv > uv. 
Similarly, (8) and (6) together with (2) imply that (u@ x) ® (v ® y) > (up x) 9v. 

3. u = 8.0000001, v = 1.2500008, w = 8.0000008; (u ® v) & w = 80.000064, yet 
u® (v & w) = 80.000057. 
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Yes; let 1/u ~ v = w, where v is large. 
Not always; in decimal arithmetic take u = v = 9. 
(a) Yes. (b) Only for b+ p < 4 (try u = 1 — 6°"). But see exercise 27. 

7. If u and v are consecutive floating binary numbers, u ® v = 2u or 2v. When it is 
2v we often have u® @ v® < 2v®. For example, u = (.10...001)2, v = (.10...010)2, 
u v = w, and u® + v® = (.10...011)2. 

8. (a) ~, ~; (b) ~, œ; (c) ~, =; :(@) ~; (e) ~. 

9. ju—w| < ju—v|+|u—w| < e1 min(b% 74, b271) +e2 min (b274, bwT) < ep bee + 
egb°w 4 < (e1 + €2) max(b 4, b°w 7). The result cannot be strengthened in general, 
since for example we might have e, very small compared to both e, and ew, and this 
means that u — w might be fairly large under the hypotheses. 

10. We have (.a1 .. . dp—14p)o@(.9... 99)» = (.a1 . . . @p—1(@p—1)) if ap > Landa: > 8; 
here “9” stands for b— 1. Furthermore, (.a1 . . . ap—14p)o Q (1.0. 0)» = (.a1 ...ap—10)b, 
so the multiplication is not monotone if b > 2 and ap > 1+ lax 2]. But when b=2, 
this argument can be extended to show that multiplication is monotone; obviously the 
“certain computer” had b > 2. 

11. Without loss of generality, let x be an integer, 0 < x < b”. If e < 0, then t = 0. If 
0 < e< p, then z -— t has at most p+ 1 digits, the least significant being zero. If e > p, 
then x — t = 0. [The result holds also under the weaker hypothesis |t| < bf; in that 
case we might have x — t = b° when e > p.] 


12. Assume that eu = p, e» < 0, u > 0. Case 1, u > bP’. Case (la), w = u +1, 


re 


v > 4, es =0. Then v =u or u+1, v = 1, u” = u, v” tor 0. Case (1b), 
w = u, |v| < 4. Then a’ = u, v' = 0, u” = u, v” 0. If |u| = 4 and more general 
rounding is permitted we might also have u’ = u+1, v” = F1. Case (1c), w = u-— 1, 
vS- ev = 0. Then w’ =uoru- 1, v = —1, u” Zaa 1 or 0. Case 2, 


u = b?™}, Case (2a), w = u+ 1, v > $, ex = 0. Like (la). Case (2b), w = u, |v| < § 
wu >u. Like (1b). Case (2c), w = u, [v| < 4, u’ < u. Then u’ = u — j/b where 
v = j/b + vı and |vi| < b7} for some positive integer j < $b; we have v' = 0, u” = u, 

" = j/b. Case (2d), w < u. Then w = u — j/b where v = —j/b + vı and |vi| < $67! 
for some positive integer j < b; we have (v',u”) = (—j/b,u), and (u', v”) = (u, —j/b) 
or (u— 1/b, (1 — j)/b), the latter case only when vı = 407+. In all cases uO u’ = u-w', 


vOv =v- v, uu" =u- u", vOv" =v -— v", round(w-— u-—v)=w-— u-v. 


13. Since round(x) = 0 if and only if x = 0, we want to find a large set of integer 
pairs (m,n) with the property that m @ n is an integer if and only if m/n is. Assume 
that |m|,|n| < bP. If m/n is an integer, then m © n = m/n is also. Conversely if 
m/n is not an integer, but m Øn is, we have 1/|n| < |m n — m/n| < $|m/n|b'-?, 
hence |m| > 2b?~'. Our answer is therefore to require |m| < 2b?~1 and 0 < |n| < b. 
(Slightly weaker hypotheses are also possible.) 

14. |(u 8v) 8w -— uvuy| < |(u 8 v) @w— (u 8 v)w| + |w| |u 8 v — uv| < fiugvjow + 
peeatu Sugu < (1 + b)O(uavj@w: Now |e(u@vjaw — Cu@(vaw)| < 2, so we may take 

= 5(14+6)(14+d°)b®. 

15. u < v implies that (u $ u) @2 < (u9 v) @2 < (v v) @ 2, so the condition holds 
for all u and v if and only if it holds whenever u = v. For base b = 2, the condition 
is therefore always satisfied (barring overflow); but for b > 2 there are numbers v 4 w 
such that v @ v = w @ w, hence the condition fails. [On the other hand, the formula 
u ® ((v © u) © 2) does give a midpoint in the correct range. Proof. It suffices to 
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show that u + (v 9 u) 2 < v, i.e., (v © u) 2 < v — u; and it is easy to verify that 
round(sround(zx)) < x for all x > 0.] 

16. (a) Exponent changes occur at $ 4o = 11.111111, 304, = 101.11110, 3°59, = 
1001.1102, $ 9001 = 10001.020, $ 30009 = 100000.91, $ 900319 = 1000000.0; therefore 
1000000 = 1109099.1. 

(b) After calculating 577, 1.2345679 = 1224782.1, (14) tries to take the square 
root of —.0053187053. But (15) and (16) are exact in this case. [If, however, rz, = 
1+|(k—1)/2]1077, (15) and (16) have errors of order n. See Chan and Lewis, CACM 22 
(1979), 526-531, for further results on the accuracy of standard deviation calculations. | 

(c) We need to show that u @ ((v © u) @ k) lies between u and v; see exercise 15. 


17. FCMP STJ 9F Floating point comparison subroutine: 

JOV OFLO Ensure that overflow is off. 
STA TEMP 
LDAN TEMP v 4 —v. 
(Copy here lines 07-20 of Program 4.2.1A.) 
LDX FV(0:0) Set rX to zero with the sign of fv. 
DEC1 5 
JiN *+2 
ENT1 0 Replace large difference in exponents 
SRAX 5,1 by a smaller one. 
ADD FU rA + difference of operands. 
JOV 7F Fraction overflow: not ~. 
CMPA EPSILON (1:5) 
JG 8F Jump if not ~. 
JL 6F Jump if ~. 
JXZ 9F Jump if ~. 
JXP 1F If |rA| = €, check sign of rA x rX. 
JAP 9F Jump if ~. (rA #0) 
JMP 8F 

7H ENTX 1 
SRC 1 Make rA nonzero with same sign. 
JMP 8F 

1H JAP 8F Jump if not ~. (rA #0) 

6H ENTA O 

8H CMPA =0= Set comparison indicator. 

9H JMP * Exit from subroutine. J 


19. Let yk Ok Nk Ok 0 for k > n. It suffices to find the coefficient of x1, 
since the coefficient of x, will be just the same except with all subscripts increased 
by &—1. Let (fx, gk) denote the coefficient of xı in (Sk — Ck, Ck) respectively. Then 
fi = (14m) (1-1-9161 —y101 - 6101-16101), gr = (1461) (1471) (1 +01+7101), and 
fe = (1—Yeon— Spon —YeOKOK) fe—1 + (Ve—Nh+YVROKL VENA VORNEA VENT +ORNKORG 
YkÔkNkOk)Jk—1,; Jk = Oh(1+Ye)(1+ ôk) fr-1 — (1 +54) (Ye + Vee + NOK A Vek) Gk—1; 
for 1 < k < n. Thus fn = 1 + m — yı + (4n terms of 2nd order) + (higher order 
terms) = 1 + m — yı + O(ne’) is sufficiently small. [The Kahan summation formula 
was first published in CACM 8 (1965), 40; see also Proc. IFIP Congress (1971), 2, 
1232, and further developments by K. Ozawa, J. Information Proc. 6 (1983), 226-230. 
Kahan observed that sn ©cn = Jz (1+ġr)£r where |ġk| < 2e+O((n+1—k)e?). For 
another approach to accurate summation, see R. J. Hanson, CACM 18 (1975), 57-58. 
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When some «’s are negative and others are positive, we may be able to match them 
advantageously, as explained by T. O. Espelid, SIAM Review 37 (1995), 603-607. See 
also G. Bohlender, IEEE Trans. C-26 (1977), 621-632, for algorithms that compute 
round(z; +--+ + £n) and round(z,...%,) exactly, given {x1,...,2n}-] 


20. By the proof of Theorem C, (47) fails for ew = p only if v| + 4 > |w— ul > 
bP} + b7}; hence |ful > |fo] > 1— (4b—1)b-”. We now find that a necessary 
and sufficient condition for failure is that |fw| is essentially rounded to 2 during the 
normalization process (actually to 2/b after scaling right for fraction overflow) —a very 
rare case indeed! 


21. (Solution by G. W. Veltkamp.) Let c = 202/21 4.1. we may assume that p > 2, 
so c is representable. First compute u’ = u Q c, u, = (u 9 u') @uw’, u2 = u O u; 
similarly, v = v@c, vı = (vOv') @v', v2 = v © vı. Then set w + u Qv, w + 
(((u1 @ v1 © w) S (u1 @ v2)) S (u2 Q v1)) E (u2 @ v2). 

It suffices to prove this when u,v > 0 and eu = e» = p, so that u and v are 
integers € [2771 ..2P). Then u = ui + u2 where 2+ <u< 2”, ui mod glp/2] _ 0, 
and |u2| < 2[p/2]-1. similarly v = vı + v2. The operations during the calculation of w 
are exact, because w — u1v1 is a multiple of 2?~' such that jw — uivi| < |w — w| + 
Juov1 + uiv2 + uzv2| < gp-1 4 gptlp/21 4 2?-1. and similarly |w — wivi — uive| < 
jw — uv| + [uzv] < gp-ty gip/2l-1+P where w — uiv1 — Uiv2 is a multiple of alpa, 


22. We may assume that b?~! < u,v < bP. If uv < b??~1, then zı = ww — r where 
|r| < 4b", hence x2 = round(u — r/v) = xo (since |r/v| < $b?~*/b?-* < 4, and 
equality implies v = b?™} hence r = 0). If uv > b??~', then zı = uv — r where 
|r| < $b”, hence z1/v = u — r/v < b? + $b and x2 < bP. If x = b”, then £3 = 21 
(since the condition (b? — i)v < xı implies that xı is a multiple of b?, and we have 
zı < bP” (v + 4)). If zo < b? and x, > b??~', then let x2 = zı/v + q where |q| < $; we 
have z3 = round(a; + qv) = zı. Finally if x2 < bP, 2; = b??~1, and z3 < b??~1, then 
La = T2 by the first case above. This situation arises, for example, when b = 10, p = 2, 
u = 19, v = 55, xı = 1000, z2 = 18, x3 = 990. 


23. If u > 0 or u < —1 we have u(mod)1 = umod1, so the identity holds. If 
—1 < u < 0, then u 1=u91=u+1+r where |r| < $b”; the identity holds if 
and only if round(1 + r) = 1, so it always holds if we round to even. With the text’s 
rounding rule the identity fails if and only if b is a multiple of 4 and —1 < u < 0 and 
u mod 2b? = 307P (for example, p = 3, b = 8, u = —(.0124)s). 


24. Letu = [u .. ur], v = [vi .. vr]. Then upv = [ju Yv.. urAvr], where rAy = yAz, 
x A +0 = z for all z, x A —0 = z for all x 4 +0, x A +œ = +o for all z 4 —o0, 
and x A —oo needn’t be defined; x Y y = —((—x) A (—y)). If x @ y would overflow 
in normal floating point arithmetic because x + y is too large, then x A y is +oo and 
x Y y is the largest representable number. 

For subtraction, let u © v = u@ (—v), where —v = [—v,.. =v]. 

Multiplication is somewhat more complicated. The correct procedure is to let 
uv = |min(u, Y v1, WY Vr, Ur Y Vi, Ur Y Vr) .. Max(u Avi, U A Vr, Ur AVY, Ur Avr), 
where z A y = y A x, z A (~y) = —(x Y y) = (~r) Ay; x A +0 = (+0 for x > 0, 
—0 for x < 0); x A —0 = —(x A +0); z A +00 = (+00 for x > +0, —oo for x < —0). 
(It is possible to determine the min and max simply by looking at the signs of w, ur, 
vı, and vr, thereby computing only two of the eight products, except when w < 0 < ur 
and vı < 0 < vr; in the latter case we compute four products, and the answer is 


[min(u, Y vr, Ur Y w) .. max(u; A vw, ur A v,)].) 
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Finally, u @ v is undefined if vy < 0 < vr; otherwise we use the formulas for 
multiplication with v; and vr replaced respectively by v7! and vyt, where z Ay! = 
rAy, «Vy = r Yy y, (+0) = +00, (+00)™t = +0. 

[See E. R. Hansen, Math. Comp. 22 (1968), 374-384. An alternative scheme, in 
which division by 0 gives no error messages and intervals may be neighborhoods of 00, 
has been proposed by W. M. Kahan. In Kahan’s scheme, for example, the reciprocal 
of [-1..+1] is [+1..—1], and an attempt to multiply an interval containing 0 by 
an interval containing oo yields [—oo..+o0], the set of all numbers. See Numerical 
Analysis, Univ. Michigan Engineering Summer Conf. Notes No. 6818 (1968).] 


25. Cancellation reveals previous errors in the computation of u and v. For example, 
if e is small, we often get poor accuracy when computing f(x + €) © f(x), because 
the rounded calculation of f(x + €) destroys much of the information about e. It is 
desirable to rewrite such formulas as € Q g(x, €), where g(x, €) = (f(x + €) — f(z))/e is 
first computed symbolically. Thus, if f(x) = x° then g(x,€) = 2x +6 if f(z) = Vz 
then g(z,€) =1/(/a+e4+ vz). 
26. Let e = max(eu, €w), e’ = max(ey, ey), e” = max(eu@v, eu/@v'), and assume that 
q=0. Then (u@v) —(u’ Ov’) < u+v4 1p” u —v'4 dpe ioe <b? +b =P, 
and e” > max(e,e’). Hence u 9v ~ u’ Gv’ (2e +b’). 

If b = 2 this estimate can be improved to 1.5e + bP. For e+ b°? is an upper 
bound if u — u’ and v — v’ have opposite signs, and in the other case we cannot have 
e= =e". 


27. The stated identity is a consequence of the fact that 1@ (1 © u) = u whenever 
b+ < fu < b-1/2. Tf the latter were false, there would be integers x and y such that 
bP} < x < bP-+/? and either y—4 <P ae < b? (x—}) < yory < b1 (a+}) < 
b?P—t/x < y+ 4. But that is clearly impossible unless we have x(x + 4) > b??~', yet 
the latter condition implies y = |b?~!/?| = «. 

28. See Math. Comp. 32 (1978), 227-232. 

29. When b = 2 and p= 1 and z > 0, we have round(x) = 2°) where e(a) = |lg îr]. 
Let f(z) = x“ and let t(n) = |lan+lg $|/a Hlg $]. Then h(2°) = 2°), When a = .99 
we find h(2°) = 2°"! for 41 < e < 58. 

31. According to the theory in Section 4.5.3, the convergents to the continued frac- 
tion V3 = 1 + //1,2,1,2,... // are pn/an = Kn4i(1,1,2,1,2,...)/Kn(1,2,1,2,...). 
These convergents are excellent approximations to v3, hence 3q2 ~ p?; in fact, 
3d, — Px = 2 — 3(n mod 2). The example given is 2p3, + (3931 — p31)(3q31 + p31) = 
2p31 — (p31 — 1+ p31) = 1. Floating point subtraction of pı from 3q3, yields zero, 
unless we can represent 3q3, almost perfectly; subtracting p3, from 9q3, generally gives 
rounding errors much larger than 2p3,. Similar examples can be based on continued 
fraction approximations to any algebraic number. 


32. (J. Ziegler Hunts, 2014.) a = 1/2 and b mod 1 = 1/4. 


SECTION 4.2.3 
1. First, (wm, wi) = (.573, .248); then wm /Um = .290; so the answer is (.572, .958). 
This in fact is the correct result to six decimals. 


2. The answer is not affected, since the normalization routine truncates to eight 
places and can never look at this particular byte position. (Scaling to the left occurs 
at most once during normalization, since the inputs are normalized.) 
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3. Overflow obviously cannot occur at line 09, since we are adding two-byte quantities, 
or at line 22, since we are adding four-byte quantities. In line 30 we are computing the 
sum of three four-byte quantities, so this cannot overflow. Finally, in line 32, overflow 
is impossible because the product fufs must be less than unity. 


4. Insert ‘JOV OFLO; ENT1 0’ between lines 03 and 04. Also replace lines 21-22 by 
‘ADD TEMP(ABS); JNOV *+2; INC1 1’, and change lines 28-31 to ‘SLAX 5; ADD TEMP; 
JNOV *+2; INC1 1; ENTX 0,1; SRC 5’. This adds five lines of code and only 1, 2, or 3 
units of execution time. 


5. Insert ‘JOV OFLO’ after line 06. Change lines 23, 31, 39 respectively to ‘SRAX 0,1’, 
‘SLAX 5’, ‘ADD ACC’. Between lines 40 and 41, insert ‘DEC2 1; JNOV DNORM; INC2 1; 
INCX 1; SRC 1’. (It’s tempting to remove the ‘DEC2 1’ in favor of ‘STZ EXPO’, but then 
‘INC2 1’ might overflow rI2!) This adds six lines of code; the running time decreases 
by 3u, unless there is fraction overflow, when it increases by Tu. 


6. DOUBLE STJ EXITDF Convert to double precision: 

ENTX 0 Clear rX. 
STA TEMP 
LD2 TEMP(EXP) rl2<¢e. 
INC2 QQ-Q Correct for difference in excess. 
STZ EXPO EXPO & 0. 
SLAX 1 Remove exponent. 
JMP DNORM Normalize and exit. 

SINGLE STJ EXITF Convert to single precision: 
JOV OFLO Ensure that overflow is off. 
STA TEMP 
LD2 TEMP(EXPD) rl2<¢ e. 
DEC2 QQ-Q Correct for difference in excess. 
SLAX 2 Remove exponent. 
JMP NORM Normalize, round, and exit. J 


7. All three routines give zero as the answer if and only if the exact result would 
be zero, so we need not worry about zero denominators in the expressions for relative 
error. The worst case of the addition routine is pretty bad: Visualized in decimal 
notation, if the inputs are 1.0000000 and —.99999999, the answer is b~” instead of b78; 
thus the maximum relative error 6, is b — 1, where b is the byte size. 

For multiplication and division, we may assume that both operands are positive 
and have the same exponent QQ. The maximum error in multiplication is readily 
bounded by considering Fig. 4: When uv > 1/b, we have 0 < w—u@v < 3b-° + 
(b — 1)b~®, so the relative error is bounded by (b + 2)b78. When 1/b? < uv < 1/b, 
we have 0 < uu—u®v < 3b-°, so the relative error in this case is bounded by 
3b-°/uv < 3b77. We take 52 to be the larger of the two estimates, namely 3b77. 

Division requires a more careful analysis of Program D. The quantity actually 
computed by the subroutine is a — 6 — be((a — 6”)(B — 6’) — 6”) — ôn where a = 
(Um + €uz)/bUm, B = vi/b¥m, and the nonnegative truncation errors (ô, 6’, 6”, 6’) are 
respectively less than (b~'°, b75, b75, b~®); finally ôn (the truncation during normal- 
ization) is nonnegative and less than either b~° or b78, depending on whether scaling 
occurs or not. The actual value of the quotient is a/(1 + bef) = a — beaB + b?a875"”, 
where 6”” is the nonnegative error due to truncation of the infinite series (2); here 
6" < & = b7", since it is an alternating series. The relative error is therefore the 
absolute value of (bed’ + bed"8/a + bed” /a) — (5/a + bed'5"/a + b? B75" + bn /a), times 
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(1 + be8). The positive terms in this expression are bounded by b7? + b78 + b78, 
and the negative terms are bounded by b78 + b~'* + b78 plus the contribution by the 
normalizing phase, which can be about b~” in magnitude. It is therefore clear that the 
potentially greatest part of the relative error comes during the normalization phase, 
and that 53 = (b+ 2)b~® is a safe upper bound for the relative error. 


8. Addition: If eu < e» +1, the entire relative error occurs during the normalization 
phase, so it is bounded above by b~”. If eu > e, +2, and if the signs are the same, again 
the entire error may be ascribed to normalization; if the signs are opposite, the error 
due to shifting digits out of the register is in the opposite direction from the subsequent 
error introduced during normalization. Both of these errors are bounded by b~’, hence 
6, =b". (This is substantially better than the result in exercise 7.) 

Multiplication: An analysis as in exercise 7 gives d2 = (b+ 2)b-°. 


SECTION 4.2.4 


1. Since fraction overflow can occur only when the operands have the same sign, 
this is the probability that fraction overflow occurs divided by the probability that the 
operands have the same sign, namely, 7%/(5(91%)) ~ 15%. 


3. logy) 2.4 — logy) 2.3 © 1.84834%. 
4. The pages would be uniformly gray. 


5. The probability that 10fu < r is (r — 1)/10 + (r — 1)/100+--- = (r —1)/9. So 
in this case the leading digits are uniformly distributed; for example, the leading digit 
is 1 with probability E. 


6. The probability that there are three leading zero bits is log,g 2 = 1; the probability 
that there are two leading zero bits is log,, 4 — log,, 2 = i and similarly for the other 
two cases. The “average” number of leading zero bits is 13, so the “average” number of 
“significant bits” is p+ Z. The worst case, p — 1 bits, occurs however with rather high 
probability. In practice, it is usually necessary to base error estimates on the worst case, 
since a chain of calculations is only as strong as its weakest link. In the error analysis of 
Section 4.2.2, the upper bound on relative rounding error for floating hex is 2'~”. In the 
binary case we can have p + 1 significant bits in all normalized numbers (see exercise 
4.2.1-3), with relative rounding errors bounded by 2717P, Extensive computational 
experience confirms that floating binary produces significantly more accurate results 
than the equivalent floating hex, even when the binary numbers have a precision of 
p bits instead of p + 1. 

Tables 1 and 2 show that hexadecimal arithmetic can be done a little faster, since 
fewer cycles are needed when scaling to the right or normalizing to the left. But this 
fact is insignificant compared to the substantial advantages of b = 2 over other radices 
(see also Theorem 4.2.2C and exercises 4.2.2-13, 15, 21), especially since floating binary 
can be made as fast as floating hex with only a tiny increase in total processor cost. 


7. For example, suppose that >>, (F(10%™ -5¥)— F(10*”)) = log 5" / log 10* and also 
that ©, (F(10*™ - 4*) — F(10*™)) = log 4" / log 10*; then 


XO (F(10®™ . 5") — F(10%™ - 4*)) = logy 5 


for all k. But now let e be a small positive number, and choose 6 > 0 so that F(x) < € 
for 0 < x < 6, and choose M > 0 so that F(x) > 1 — e for x > M. We can take k so 
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large that 107* .5* < ô and 4* > M; hence by the monotonicity of F, 


XO (F(10""-5*) — P(10®™.4*)) 
< X (P00. 5") — P(10®®8.5®))+ XO (Lon). 4*) — Fo". 4*)) 


m<0 m>0 


= F(10-*-5*)+1—F(4*) < 2e. 


8. When s > r, Po(10"s) is 1 for small n, and 0 when |10"s| > |10”r]. The least n 
for which this happens may be arbitrarily large, so no uniform bound can be given 
for No(e) independent of s. (In general, calculus textbooks prove that such a uniform 


bound would imply that the limit function So(s) would be continuous, and it isn’t.) 
9. Let qi, q2, ... be such that Po(n) = qı Co + qo rae +--+ for all n. It follows 


that Pm(n) = 17" q oe +2 mgr 1) +--- for all m and n. 


10. When 1 < r < 10 the generating function C(z) has simple poles at the points 
1 + wn, where wn = 27ni/In 10, hence 


—wnlnr _ 1 


lo r—1 1+ wn e 
C(z) = “B12 >a | B(2) 


beer (In 10)(z — 1 — wn) 
where F(z) is analytic in the entire plane. Thus if 6 = arctan(27/In 10), 
2 eo tn lnr 1 
c 810 In 10 3 Wn(1 + wn)™ e 


sin(mé@ + 27 logio r) —sin(mé) | 1 
1+ m/2 + O m/2 
m(1 + 4r?/(ln 10)2) (1 + 1672/(In 10)2) 


= logjor 


11. When (log, U) mod 1 is uniformly distributed in [0..1), so is (log, 1/U) mod 1 = 
(1 — log, U) mod 1. 


12. We have 
zj= ° x) dx g(z/bx)/ba + i x) dx g(z/x)/zx; 
) [fo g(z/bx)/ f rœ g(z/x)/ 


consequently 


h(z)—Ue) F 4, pa Ml) a(z/2) — Uz/2). 
Ta = f foe Taa JEO MES T 


Since f(x) > 0, |(h(z) — U(z))/U(z)| < Sin fe x) dx A(g) + f f(x)dx A(g) for all z, 
hence A(h) < A(g). By symmetry, A(h) < [Bell System Tech. J. 49 (1970), 
1609-1625. ] 


13. Let X = (log,U) mod1 and Y = (log, V) mod 1, so that X and Y are inde- 
pendently and uniformly distributed in [0..1). No left shift is needed if and only if 
X +Y > 1, and that occurs with probability 1/2. 

(Similarly, the probability is 1/2 that floating point division by Algorithm 4.2.1M 
needs no normalization shifts; this analysis needs only the weaker assumption that both 
of the operands independently have the same distribution.) 
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14. For convenience, the calculations are shown here 


for b = 10. If k = 0, the probability of a carry is ‘ar 1 
( 1 ) dx dy 
Inl0/ Jisevs10 œ y` 
z+y>10 


(See Fig. A-7.) The value of the integral is 


i. dy f dz af dy T dx e 

’ 0 0 
o Y sio-y T o Y Jio-y © 0 10 
and 


[Gam [Grab tse ee bt Set 
o Y 1—y/10 o \10 200 ` 3000 ' Y= 70 400 * 9000 
(The latter integral is essentially a “dilogarithm.”) Hence the probability of a carry 
when k = 0 is (1/In 10)?(7?/6 — 20,5, 1/n?10") ~ .27154. [Note: When b = 2 and 
k = 0, fraction overflow always occurs, so this derivation proves that >. 1/ n?2” = 
n?/12 — (In 2)?/2.] 7 

When k > 0, the probability is 


1 uA a 1 j Seles] 
InlO/ Jio» Y Jio-y © In 10 n?10"* n210re+)) J 


n21 n>1 


= 


Thus when b = 10, fraction overflow should occur with approximate probability .272po+ 
.017p1 + .002p2 +---. When b = 2 the corresponding figures are po + .655pi + .288p2 + 
.137p3 + -067p4 + .033p5 + -016ps + .008p7 + .004ps + .002p9 + .001p10 +: 

Now if we use the probabilities from Table 1, dividing by .91 to eliminate zero 
operands and assuming that the probabilities are independent of the operand signs, we 
predict a probability of about 14 percent when b = 10, instead of the 15 percent in 
exercise 1. For b = 2, we predict about 48 percent, while the table yields 44 percent. 
These results are certainly in agreement within the limits of experimental error. 


15. When k = 0, the leading digit is 1 if and only if there is a carry. (It is possible 
for fraction overflow and subsequent rounding to yield a leading digit of 2, when b > 4, 
but we are ignoring rounding in this exercise.) The probability of fraction overflow is 
approximately .272, as shown in the previous exercise, and .272 < log,, 2. 

When k > 0, the leading digit is 1 with probability 


SUE Sf sens) < Cet) e) 
m10/ (Jior yJ 1S#<2-¥ x m10/ (Jio Uda j) 50S 


or 10-—y<2<10 


16. To prove the hint [which is due to Landau, Prace Matematyczno-Fizyczne 21 
(1910), 103-113], assume first that lim sup an = A > 0. Let € = \/(A+4M/) and choose 
N so that ja; +-+++a@n| < eàn for all n > N. Let n > N/(1 — €), n > 5/e be such 
that an > ZA. Then, by induction, an-k > an — kM/(n — en) > =A fr 0 < k <en, 
and Xp enck<n 4k = Alen — 1) > ten. But 


= = 1 
D n-en<k<n ik alat Vi<k<n—en Ue] S zen 


since n — en > N. A similar contradiction applies if lim inf an < 0. 
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Assuming that Pn+4i(n) > A as n > œ, let ap = Pm(k) — à. If m > 0, the ak 
satisfy the hypotheses of the hint (see Eq. 4.2.2-(15)), since 0 < Pm(k) < 1; hence 
Pm(n) > à. 

17. See J. Math. Soc. Japan 4 (1952), 313-322. (The fact that harmonic prob- 
ability extends ordinary probability follows from a theorem of Cesàro, [Atti della 
Reale Accademia dei Lincei, Rendiconti (4) 4 (1888), 452-457]. Persi Diaconis [Ph.D. 
thesis, Harvard University, 1974] has shown among other things that the definition of 
probability by repeated averaging is weaker than harmonic probability, in the following 
precise sense: If liMmm—oo liminfn+oo Pm(n) = liMm—o limsup,_,,, Pm(Nn) = » then 
the harmonic probability is A. On the other hand the statement “10*? < n < 10k°+# 
for some integer k > 0” has harmonic probability H, while repeated averaging never 
settles down to give it any particular probability.) 

18. Let p(a) = P(La) and p(a,b) = X a<p<p p(k) for 1 < a < b. Since La = Lioa U 
Lioa+ı U ++: U Lioa+9 for all a, we have p(a) = p(10a,10(a+ 1)) by (i). Furthermore 
since P(S) = P(2S) + P(2S + 1) by (i), (ii), (iii), we have p(a) = p(2a,2(a + 1)). It 
follows that p(a, b) = p(2™10”a, 2™10”b) for all m,n > 0. 

If 1 < b/a < b'/a', then p(a,b) < p(a',b'). The reason is that there exist integers 
m, n, m', n! such that 2™10%a' < 2™10”a < 2™10”b < 2”"10"'b! as a consequence 
of the fact that log 2/log 10 is irrational, hence we can apply (v). (See exercise 3.5—22 
with k = 1 and Un = nlog 2/log 10.) In particular, p(a) > p(a+ 1), and it follows that 
p(a, b)/p(a,b +1) > (b — a)/(b+ 1 — a). (See Eq. 4.2.2-(15).) 

Now we can prove that p(a,b) = p(a’,b’) whenever b/a = b'/a’; for p(a,b) = 
p(10"a,10"b) < cnp(10"a,10"b — 1) < enp(a’,b’), for arbitrarily large values of n, 
where cn = 10"(b— a)/(10"(b — a) — 1) = 14 O(107~”). 

For any positive integer n we have p(a”, b”) = p(a”, ba"—') + p(ba"—", b?a"~?) + 
--- + pba, b”) = np(a,b). If 10" < a” < 107 and 10” <b” < 10 +i, then 
p(lo™+!, 10") < pla”, b”) < p(10™, 10™ +t) by (v). But p(1,10) = 1 by (iv), hence 
p10”, 10") = m! — m for all m’ > m. We conclude that [logio 6" | — [logio a”]—1 < 
np(a,b) < [logio b”] + Ulogioa”] +1 for all n, and p(a, b) = log,)(b/a). 

[This exercise was inspired by D. I. A. Cohen, who proved a slightly weaker result 
in J. Combinatorial Theory A20 (1976), 367-370.] 


19. Equivalently, ((log, Fn) mod 1) is equidistributed in the sense of Definition 3.5B. 
Since logy Fn = nlogjo ¢ — logio V5 + O(¢- 2”) by 1.2.8-(14), this is equivalent to 
equidistribution of (n logio ¢), which follows from ex. 3.5-22. [Fibonacci Quarterly 5 
(1967), 137-140.] The same proof shows that the sequences (b”) obey the logarithmic 
law for all integers b > 1 that aren’t powers of 10 [Yaglom and Yaglom, Challeng- 
ing Problems with Elementary Solutions (Moscow: 1954; English translation, 1964), 
Problem 91b]. 

Notes: Many other sequences of integers have this property. For example, Persi 
Diaconis [Annals of Probability 5 (1977), 72-81] showed that (n!) is one such sequence, 
and that binomial coefficients obey the logarithmic law too, in the sense that 


n 


: i 5 [10,f(n) <r] =logigr. 


k=0 


lim 
n—-oo n 


P. Schatte [Math. Nachrichten 148 (1990), 137-144] proved that the denominators 
of continued fraction approximations have logarithmic fraction parts, whenever the 
partial quotients have a repeating pattern with polynomial variation as in exercise 
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4.5.3-16. One interesting open question is whether the sequence (2!, (2!)!, ((2!)!)!,...) 
has logarithmic fraction parts; see J. H. Conway and M. J. T. Guy, Eureka 25 (1962), 
18-19. 


SECTION 4.3.1 


2. If the ith number to be added is u; = (t(n-1) ..- Ui1Uio)b, use Algorithm A with 
step A2 changed to the following: 


A2’. [Add digits.] Set 


Wj <$ (uij fo +++ + Umj | k) mod b, and k4 | (ui; f +++ + Umj | k)/b|. 


(The maximum value of k is m — 1, so step A3 would have to be altered if m > b.) 


3. ENN1 N 1 
JOV OFLO 1 Ensure that overflow is off. 
ENTX 0 1 k +0. 
2H SLAX 5 N (rX = next value of k) 
ENT3 M*N,1 N  (LOC(ui;) =U+n(i—1)+ j) 


3H ADD U,3 MN rA rA + uij. 
JNOV *+2 MN 
INCX 1 K Carry one. 
DEC3 N MN Repeat form > i> 1. 
J3NN 3B MN (113 = n(i— 1) +3) 
STA W+tN,1 N w,<rA. 


INC1 1 N 
JiN 2B N Repeat for 0 <j <n. 
STX W+N 1 Store final carry in wn. J 


Running time, assuming that kK = MN, is 5.5MN + 7N +4 cycles. 

4. We may make the following assertion before Al: “n > 1; and 0 < ui, vi < b for 
0<i<n.” Before A2, we assert: “0 < j < n; 0 < ui, vi < b for 0 <i<n; 0< w <b 
fr0<i< j; 0<k< 1; and (uj—1 ae .uo)b + (vj—1 Ste -V0)b = (kw;-1 a . Wo) pb.” The 
latter statement means more precisely that 


5 ub! + a vb! = kb? + 5 wb". 


0<1<j O<1<j 0<1<j 


Before A3, we assert: “0 < j < n; O < ui, vi <bfor0<i<n;0<wi<bfor0<i< j; 
0 < k <1; and (uj... uo)b + (vj... vo)e = (kwj... wo)b.” After A3, we assert that 
0< wi < bfor0 <i<n;0< wn <1; and (un-1... uo) +(Un-1 .. . vo)b = (Wn... Wo)o- 
It is a simple matter to complete the proof by verifying the necessary implications 
between the assertions and by showing that the algorithm always terminates. 


5. B1. Set j 4} n — 1, wn + 0. 


B2. Set t + uj + vj, wj + t mod b, i+ j. 


B3. If t > b, set i + i + 1, t | wi +1, wi + t mod b, and repeat this step until 
t <b. 


B4. Decrease j by one, and if j > 0 go back to B2. J 


6. Cl. Set j e n-licn, rod. 
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C2. Set t+ uj + vj. Ift > b, set w; + r+ 1 and wg + 0 fori > k > j; then set 
i 4 j and r + tmodb. Otherwise if t < b—1, set w; + rand wp + b-1 
for i > k > j; then set i + j and r & t. 


C3. Decrease j by one. If 7 > 0, go back to C2; otherwise set w; < r, and 
wk + b—1fri>k>0. I 


7. When j = n — 3, for example, we have k = 0 with probability (b + 1)/2b; k = 1 
with probability ((b — 1)/2b)(1 — 1/b), namely the probability that a carry occurs and 
that the preceding digit wasn’t b— 1; k = 2 with probability ((b —1)/2b)(1/b)(1 — 1/b); 
and k = 3 with probability ((b — 1)/2b)(1/b)(1/b)(1). For fixed k we may add the 
probabilities as 7 varies from n — 1 to 0; this gives the mean number of times the carry 
propagates back k places, 


Mk = o (cm H1 K(i >) | 5): 


As a check, we find that the average number of carries is 


1 1 1\" 
mi amai Erin s(n pail (5) lk 


in agreement with (6). 


8. ENT1 N-1 1 3H LDA W,2 K 
JOV OFLO 1 INCA 1 K 
STZ W+N 1 STA W,2 K 
2H LDA U,1 N INC2 1 K 
ADD V,1 N JOV 3B K 
STA W,1 N 4H DECi 1 N 

JNOV 4F N JiNN 2B N Il 


ENT2 1,1 L 


The running time depends on L, the number of positions in which uj + v; > b; and on 
K, the total number of carries. It is not difficult to see that K is the same quantity 
that appears in Program A. The analysis in the text shows that L has the average 
value N((b— 1)/2b), and K has the average value 4 (N — b™t — b7? — --- — b7”). So if 
we ignore terms of order 1/b, the running time is 9N + L+7K +3 ~ 13N +3 cycles. 


9. Replace “b” by “b;” everywhere in step A2. 


10. If lines 06 and 07 were interchanged, we would almost always have overflow, but 
register A might have a negative value at line 08, so this would not work. If the 
instructions on lines 05 and 06 were interchanged, the sequence of overflows occurring 
in the program would be slightly different in some cases, but the program would still 
be right. 


11. This is equivalent to lexicographic comparison of strings: (i) Set j + n — 1; (ii) if 
uj < vj, terminate [u < v]; if uj = vj and j = 0, terminate [u = v]; if uj = vj and 
j > 0, set j + j— 1 and repeat (ii); if uj > vj, terminate [u > v]. This algorithm tends 
to be quite fast, since there is usually low probability that j will have to decrease very 
much before we encounter a case with uj Æ vj. 


12. Use Algorithm S with uj = 0 and vj = wj. Another borrow will occur at the end 
of the algorithm; this time it should be ignored. 
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13. ENN1 N 1 MUL V N STA W+N,1 N 
JOV OFLO 1 SLC 5 N INC1 1 N 
ENTX 0 1 ADD CARRY N JiN 2B N 
2H STX CARRY N JNOV *+2 N STX W+N 1 | 
LDA U+N,1 N INCX 1 K 


The running time is 23N + K + 5 cycles, and K is roughly iN. 


14. The key inductive assertion is the one that should be valid at the beginning of 
step M4; all others are readily filled in from this one, which is as follows: 0 < i < m; 
0<j<n0<u<bford0O<l<mOd<u<bfor0<l<n0< w < 6 for 
0<l<j+m;0<k < b; and, in the notation of the answer to exercise 4, 


(wj4+m—1 Saw wo)b t kot =U (vj Frar vo)b + (ui-1 hes uo)b x vjb. 


15. The error is nonnegative and less than (n — 2)b~"~+. [Similarly, if we ignore the 
products with i + j > n + 3, the error is bounded by (n — 3) b7”? etc.; but, in some 
cases, we must compute all of the products if we want to get the true rounded result. 
Further analysis shows that correctly rounded results of multiprecision floating point 
fractions can almost always be obtained by doing only about half the work needed to 
compute the full double-length product; moreover, a simple test will identify the rare 
cases for which full precision is needed. See W. Krandick and J. R. Johnson, Proc. 
IEEE Symp. Computer Arithmetic 11 (1993), 228-233.] 
16. Q1.Setr+0,j¢ n-1. 

Q2. Set wj + [(rb+u;)/v], r 4 (rb + uj) mod v. 

Q3. Decrease j by 1, and return to Q2 if j> 0. I 
17. u/v > unb” /(un—-1 +1)b”Tt = b(1 — 1/(vn-1 + 1)) > b(1 — 1/(b/2)) = b — 2. 
18. (unb + un-1)/(Vn-1 + 1) < u/(vn-1 + 1)b”7} < u/v. 
19. u— ĝu < u—Gun_1b" t — Gun_2b" >? = Un—2b" 7 +--+ uo +D — Gun_2b"? < 
b”? (un-2 +1 + Fb — ĝun—2) < 0. Since u — Gu < 0, q < @. 
20. If q < G—2, then u < (ĝ—1)v < G(vn—1b" + + (vn-2+ 1)b"~?) —v < GUn_1b" 1 + 
GUn—2b"-2 + b"-1 — v < Gun_10 1 + (bf + Un_2) 0°72 +07! — v = unb” + nb" 1 + 
Un—2b" 2 + b”’-t = y < unb” + Un—1b™ 1 + Un—2b"~2 < u. In other words, u < u, and 
this is a contradiction. 
21. (Solution by G. K. Goyal.) The inequality ĝun-2 < bf + un—2 implies that 
we have ĝ < (unb? + Un—1b + Un—2)/(Un-1b + Un-2) < u/((un—1b + Un—2)b"~?). 
Now umodv = u — qu = v(1 — a) where 0 < a = 1+ q-u/v < G-u/v < 
u(1/((vn—1b + Un—2)b"~?) = 1/v) = u(Un—3b”7? + ---)/((Un-1b + Un—2)b"*v) < 
u/(Un—1bv) < G/(Un—1b) < (b—1)/(vn—1b), and this is at most 2/b since un—1 > $(b—-1). 
22. Let u = 4100, v = 588. We first try â= | ¥ | = 8, but 8-8 > 10(41—40)+0. Then 
we set ĝ = 7, and now we find 7-8 < 10(41 — 35) +0. But 7 times 588 equals 4116, so 
the true quotient is q = 6. (Incidentally, this example shows that Theorem B cannot 
be improved under the given hypotheses, when b = 10. Similarly, when b = 21° we can 
let u = (7£££800100000000) 16, v = (800080020005) 16.) 
23. Obviously v|b/(v + 1)| < (v + 1)|b/(v + 1)| < b; and the lower bound certainly 
holds if v > b/2. Otherwise v|b/(v + 1)| > v(b — v)/(v + 1) > (b — 1)/2 > |b/2] — 1. 
24. The approximate probability is only log, 2, not 3. (For example, if b = 23. the 
probability that v,_1 > 2?! is approximately 5i this is still high enough to warrant 
the special test for d = 1 in steps D1 and D8.) 
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25. 002 ENTA 1 1 
003 ADD V+N-1 1 
004 STA TEMP 1 
005 ENTA 1 1 
006 JOV 1F 1 Jump if vn-1 =b— 1. 
007 ENTX 0 1 
008 DIV TEMP 1 Otherwise compute |b/(un—1 + 1)]. 
009 JOV DIVBYZERO 1 Jump if vn—-1 = 0. 
010 1H STA D 1 
011 DECA 1 1 
012 JANZ *+3 1 Jump if d £ 1. 
013 STZ U+M+N 1— A Set Um+n + 0. 
014 JMP D2 l= Á 
015 ENN1 N A Multiply v by d. 
016 ENTX 0 A 
017 2H STX CARRY AN 
018 LDA V+N,1 AN 
019 MUL D AN 
vee (as in exercise 13) 
026 JiN 2B AN 
027 ENN1 M+N A (Now rX = 0.) 
028 2H STX CARRY A(M +N) Multiply u by d. 


029 LDA U+M+N,1 A(M +N) 
(as in exercise 13) 


0397. JIN 2B A(M +N) 


038 STX U+M+N A I 
26. (See the algorithm of exercise 16.) 
101 D8 LDA D 1 (Remainder will be left in 
102 DECA 1 1 locations U through U+N-1) 
108 JAZ DONE 1 Terminate if d= 1. 
104 ENT1 N-1 A rl=j;jen-1. 
105 ENTA 0 A r¢€0od. 


106 1H LDX U,1 AN rAX + rb+uj. 

107 DIV D AN 

108 STA U,1 AN 

109 SLAX 5 AN (uj,r)+ (|rAX/d],rAX mod d). 
110 DECI 1 AN j<j-l. 

111 JiNN 1B AN Repeat forn>j>0. J 


At this point, the division routine is complete; and by the next exercise, rAX = 0. 
27. It is du mod dv = d(u mod v). 
28. For convenience, let us assume that v has a decimal point at the left, i.e., v = 


(Un-Un—1Un—2---)o. After step N1 we have 4 <v<1+1/b: For 


b+1 v(b+1) v1 +1/b) 4 
e| js Unat ~ +o 


Un—-—1 T 1 
and 


a b+1 S v(b+ 1 — Un—1) s 1 Un-1(b + 1-— vn-1) 
Un-1 T 1 Un—1 + 1 Tob Un—1 + 1 
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The latter quantity takes its smallest value when v,,_1 = 1, since it is a concave function 
and the other extreme value is greater. 
. x b(b+1) jv 
The formula in step N2 may be written v + z ailp so we see as above 
that v will never become > 1 + 1/b. Un—1 + 
The minimum value of v after one iteration of step N2 is > 


(ma s (a a = hy ttt) (Et) 
Un—-1 +1 b= Un—-1+1 b2 t b2 
-i4 t3 2 5 (t+ Ott) 
b Be BN t : 


if t = vn-1 +1. The minimum of this quantity occurs for t = b/2 + 1; a lower bound 
is 1 — 3/2b. Hence un-1 > b— 2, after one iteration of step N2. Finally, we have 
(1 — 3/2b)(1 + 1/b)? > 1, when b > 5, so at most two more iterations are needed. The 
assertion is easily verified when b < 5. 


29. True, since (ujin...Uj)y < v. 


30. In Algorithms A and S, such overlap is possible if the algorithms are rewritten 
slightly; for example, in Algorithm A we could rewrite step A2 thus: “Set t + uj+vj+k, 
wj + tmod b, k + |t/b|.” 

In Algorithm M, vj may be in the same location as wj+n. In Algorithm D, it 
is most convenient (as in Program D, exercise 26) to let rn—1...ro be the same as 
Un—1-..Uo; and we can also let qm ...qo be the same as Um+n...Un, provided that no 
alteration of uj+n is made in step D6. (Line 098 of Program D can safely be changed 
to ‘J1N 2B’, since uj;+n isn’t used in the subsequent calculation.) 


31. Consider the situation of Fig. 6 with u = (ujin...uj+1uj)3 as in Algorithm D. 
If the leading nonzero digits of u and v have the same sign, set r + w—v, q & 1; 
otherwise set r 4+ u +v, q + —1. Now if |r| > |u|, or if |r| = |u| and the first nonzero 
digit of uj—1 . . . uo has the same sign as the first nonzero digit of r, set q < 0; otherwise 
set Uj+n -- -uj equal to the digits of r. 


32. See M. Nadler, CACM 4 (1961), 192-193; Z. Pawlak and A. Wakulicz, Bull. de 
l’Acad. Polonaise des Sciences, Classe III, 5 (1957), 233-236 (see also pages 803-804); 
and exercise 4.1—15. 


34. See, for example, R. E. Maeder, The Mathematica Journal 6,2 (Spring 1996), 
32-40; 6,3 (Summer 1996), 37-43. 


36. Given ¢ with an accuracy of 427°”, we can successively compute ¢~', $7”, 

by subtraction until @~* < a the accumulated error will not exceed 217”. Then 
we can use the series In¢@ = h(i + 4- 3)/(1— ¢™°?)) = (47? 4 tg’ tag P4---). 
[See William Schooling’s article in Napier Tercentenary Memorial, edited by C. G. 
Knott (London: Longmans, 1915), 337-344.] An even better procedure, suggested in 
1965 by J. W. Wrench, Jr., is to evaluate 


Ing = $In((1+57-/?)/(1—57-/?)) = (26 — 1)(57* + 457? + $5 84+ --). 


37. Let d = 2° so that b > dun—1 > 6/2. Instead of normalizing u and v in step D1, 
simply compute the two leading digits v'v” of 2°(Un—1Un—2Un—3)o by shifting left 
e bits. In step D3, use (v’,v”) instead of (Un—1,Un—2) and (u’,u”,u’”) instead of 
(Ujin; Uj+n—1;Uj+n—2), where the digits u’u’u'” are obtained from (ujin...Uj+n—3)b 
by shifting left e bits. Omit division by d in step D8. (In essence, u and v are being 
“virtually” shifted. This method saves computation when m is small compared to n.) 
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38. Set k + n, r + 0, s + 1, t + 0, w + u; we will preserve the invariant relation 
uv = 2?F (r + 8? — s) + 25t 4 27k 22 ow with 0 < t,w < 2”, and with 0 < r < 2s 
unless (r, s) = (0,1). While k > 0, let 4w = 2”w’ + w” and 4t + w'v = 2”t' +t”, where 
0 < w”,t” < 2” and 0 < t’ < 6; then set t + t”, w + w”, s + 2s, r + 4r +t- s,, 
k 4+ k-— 1. If r <0, set s 4+ s — 1 and r + r + 2s; otherwise, if r > 2s, set r 4+ r — 2s 
and s + s+ 1 (this correction might need to be done twice). Repeat until k = 0. Then 
uv = r+ 3? — s, since w is always a multiple of 27"~?*. Consequently r = 0 if and only 


if uwv = 0; otherwise the answer is s, because uv — s < 3? < ww + s. 


39. Let Sj = 0,5) 16-*/(8k+ J). We want to know whether or not 2”~1r mod 1 < $. 
Since m = 45; — 254 — S5 — S6, it suffices to have good estimates of 2-1; mod 1. Now 
2-1 5; is congruent (modulo 1) to Do<ken/a Ongk/(8kK+39)+ Viesnsa 2r-1~48/(8k + j), 
where anjk = 2”~'~** mod (8k + j). Each term in the first sum can be approximated 
within 27™ by computing anjk in O(log n) operations (Section 4.6.3) and then finding 
the scaled quotient |2" anjk/(8k + j)|. The second sum can be approximated within 
2~-™ by computing 2” times its first m/4 terms. If m ~ 2lgn, the range of uncertainty 
will be = 1/n, and this will almost always be accurate enough. [Math. Comp. 66 
(1997), 903-913.] 

Notes: Let Ç = e™/+ = (1+ i)/V2 be an 8th root of unity, and consider the 
values l; = In(1 — ¢7//2). Then lo = In(1 —1/V2), h = l7 = ln į — iarctan1, 
l2 = le = į ln $ — iarctan(1/V2), l3 = ls = 4 ln 2 — iarctan(1/3), la = In(1+ 1/V2). 
Also —S;/27/? = Ello +h +---4 Clr) for 1 < j < 8 by 1.2.9-(13). Therefore 
4S1 — 254 — Ss — S6 = 2lo — (2 — 2i)2l1 +2l4 + (2+ 2i)l7 = m. Other identities of interest 
are: 


In2 = S2 + 4S4 + 456 + 4S8; 
ln 3 = 262 + i S6; 
Ind = 2S2 + 294 + $ S6; 
V2 In(V2 +1) = S1 + $93 + $955 + $57; 
V2 arctan(1/V2) = S1 — 493 + 455 tSr; 
arctan(1/3) = Sı — S2 — 494 — 4 95; 
0 = 8S1 — 8S2 — 453 — 8S4 — 255 — 256 + Sv. 


In general we have 


z8k+1 z8k+5 
=A+B4 D, =A-B+4 D, 
> Sk+1 G > 8k +5 Á 
k>0 k>0 
z8k+3 z8k+7 
= A-B D =A+B D, 
8k+3 Ç , 8k+7 a 
k>0 k>0 
where 
1, 1+z 1, 1+vV2z+2? 
Acin = zz i 
8  1-z 2 1—V2z24 22 
1 
Ge Mare l V2z | 
4 arctan z, D 3572 arctan T =) : 
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and 
gmkta 1 A 
mkta za z) + (—1) [meven] In(1 +2) + fam(2)), 
k>0 
L(m—1)/2] 
Qrka 2rk 2 

D = In( 1-2 t 
fam (2) >, (cos a n( 2 COS a 2 ) 


. 2rka 
2 sin arctan 
m 


zsin(2rk/m) ) 
1 — zcos(2rk/m) j ` 


40. To get the most significant n/2 places, we need about D x in? basic operations 


(see exercise 15). And we can get the least significant n/2 places by using a b-adic 
method when b is a power of 2 (see exercise 4.1—31): The problem is easily reduced to the 
case where v is odd. Let u = (...u2uiuo)s, V = ( . . . U2V1v0)b, and w = (...w2wiwo)e, 
where we want to solve u = vw (modulo per?) Compute v’ such that v'v modb = 1 
(see exercise 4.5.2-17). Then wo = v’uo mod b, and we can compute u’ = u — wov, 
wi = v'ug modb, etc. The rightmost n/2 places are found after about in? basic 
operations. So the total is jn? + O(n), while Algorithm D needs about n? + O(n). A 
pure right-to-left method for all n digits would require 4n? + O(n). [See A. Schönhage 
and E. Vetter, Lecture Notes in Comp. Sci. 855 (1994), 448-459; W. Krandick and 
T. Jebelean, J. Symbolic Computation 21 (1996), 441—455.] 


Al. (a) If m = 0, let v = u. Otherwise subtract xw from (Um+n-1 .. . u1uo)b, where 
x = uow’ mod b; this zeroes out the units digit, so we have effectively reduced m 
by 1. (This operation is closely related to the computation of u/w in b-adic arithmetic, 
since u/w = q+ b™v/w for some integer q; see exercise 4.1-31. It wins over ordinary 
division because we never have to correct a trial divisor. To compute w’ when b is a 
power of 2, notice that if wow’ = 1 (modulo 2°) then wow” = 1 (modulo 27°) when 
w” = (2—wow')w’, by the 2-adic analog of “Newton’s method.”) 

(b) Apply (a) to the product uv. Memory space is conserved if we interlace 
multiplication and modulation as follows: Set k + 0, t + 0. Then while k < n, 
preserve the invariant relation bt = (ug_1...uo)v (modulo w) by setting t — t+ ug, 
t + (t—aw)/b, k + k+1, where x = tow’ mod b is chosen to make t—xw a multiple of b. 
This solution assumes that t, u, and v have a signed magnitude representation; we can 
work also with nonnegative numbers < 2w or with complement notations, as discussed 
by Shand and Vuillemin and by Kornerup, [IEEE Symp. Computer Arithmetic 11 
(1993), 252-259, 277-283]. If n is large, the techniques of Section 4.3.3 speed up the 
multiplication. 

(c) Represent all numbers congruent to u (modulo w) by an internal value r(w) 
where r(u) = b"u. Then addition and subtraction are handled as usual, while mul- 
tiplication is r(uv) = bmult(r(u),r(v)), where bmult is the operation of (b). At the 
beginning of the computation, replace each operand u by r(u) = bmult(u, a), using 
the precomputed constant a = b?” mod w. At the end, replace each r(u) by u = 
bmult(r(u),1). [In the application to RSA encryption, Section 4.5.4, we could redefine 
the coding scheme so that precomputation and postcomputation are unnecessary.] 


42. An interesting analysis by J. M. Holte in AMM 104 (1997), 138-149, establishes 
the exact formula 


made m J], -jn /m+1 pri 
Pak = le Js a +17. 
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The inner sum is yey ik +1—r)™ = (7) when j = 0. (Exercise 5.1.3-25 


explains why Eulerian numbers arise in this connection.) 

43. By exercise 1.2.4-35 we have w = |W/2'°|, where W = (28+1)t = (28+1)(uv +2"). 
Therefore if uv/255 > c+4, we have c < 2°, hence w > |(2'°(e+1)+2°—c)/2'°| > c+1; 
if uv/255 < c+ 4, we have w < |(21°(c + 1) —c—1)/2'°| = c. [See J. F. Blinn, IEEE 
Computer Graphics and Applic. 14,6 (November 1994), 78-82.] 


SECTION 4.3.2 


1. The solution is unique since 7-11-13 = 1001. The constructive proof of Theorem C 
tells us that the answer is ((11-13)®+6-(7-13)1°+5-(7-11)/”) mod 1001. But this answer 
is perhaps not explicit enough! By (24) we have vı = 1, ve = (6 — 1) -8 mod 11 = 7, 
v3 = ((5 — 1) -2 — 7) - 6 mod 13 =6,s0u=6-7-11+7-7+1=512. 


2. No. There is at most one such u; the additional condition ug = --- = ur 
(modulo 1) is necessary and sufficient, and it follows that such a generalization is not 
very interesting. 


3. u = u; (modulo m;) implies that u = u; (modulo ged(m;,m,)), so the condition 
ui = uj (modulo ged(m;, mj)) must surely hold if there is a solution. Furthermore if 
u = v (modulo m,) for all j, then u — v is a multiple of lem(m1,...,m,) = m; hence 
there is at most one solution. 

The proof can now be completed in a nonconstructive manner by counting the 
number of different r-tuples (u1,...,ur) satisfying the conditions 0 < uj < mj and 
ui = uj (modulo gcd(m:,m;)). If this number is m, there must be a solution since 
(umodmi,...,umodm,) takes on m distinct values as u goes from a toa+m-—1. 
Assume that u1, ..., Ur—1 have been chosen satisfying the given conditions; we must 
now pick ur = uj (modulo ged(m;,m,)) for 1 < j < r, and by the generalized Chinese 
remainder theorem for r — 1 elements there are 

m,./lem(ged(mi,m,),--.,gcd(M,-1, Mr)) = ™m,/ged(Iem(m1,...,Mr—1), Mr) 
=Iem(m1,...,mr)/lem(m1,. .., Mr—1) 
ways to do this. [This proof is based on identities (10), (11), (12), and (14) of 
Section 4.5.2.] 

A constructive proof [A. S. Fraenkel, Proc. Amer. Math. Soc. 14 (1963), 790-791] 
generalizing (25) can be given as follows. Let M; = lem(m1,...,m,;); we wish to find 
u = vrMr-ı +--+ + v2M1 + v1, where 0 < v; < M;/Mj-1. Assume that v1, ..., vj—1 
have already been determined; then we must solve the congruence 


v3 Mj—1 + 0j3-1Mj-2 +- +01 =u; (modulo m;). 


Here vj-1Mj_-2 +--+ v1 = ui = uj (modulo ged(m;:,m,)) for i < j by hypothesis, so 
c= uj — (vj-1Mj-2 + +++ +1) is a multiple of 


lem(ged(mi,m;),...,gcd(m j—1,m ;)) = ged(Mj_1,m,;) = dj. 


We therefore must solve v;M;-1 = c (modulo m;). By Euclid’s algorithm there is a 
number c; such that c;M;-1 = dj (modulo mj); hence we may take 


vj = (cj c)/dj mod (m;/d;). 


Notice that, as in the nonconstructive proof, we have m;/d; = M;/Mj-1. 
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4. (After m4 = 91 = 7-13, we have used up all products of two or more odd primes 
that can be less than 100, so ms, ... must all be prime.) We find 


my7 = 79, msg = 73, mg = 71, mio = 67, miu = 61, 

mız = 59, mi3 = 53, mia = 47, mis = 43, mig = 41, 

m7 = 37, mig = 31, mig = 29, m20 = 23, M21 = 17, 
and then we are stuck (m22 = 1 does no good). 


5. (a) No. The obvious upper bound, 


345272111... = I ple 100) 
p odd 
p prime 
is attained if we choose mı = 3*, m2 = 57, etc. (It is more difficult, however, to 


maximize m1 ...m, when r is fixed, or to maximize e1+---+e, with relatively prime ej 
as we would attempt to do when using moduli 2°/ — 1.) (b) Replacing 100 by 256 and 
allowing even moduli gives 283553... 251! = 1.67 - 101°. 

6. (a) Ife = f + kg, then 2° = 2/(29)* = 2/-1* (modulo 2% — 1). So if 2° = 2f 
(modulo 29 — 1), we have 2€ ™°d9 = 2f mods (modulo 29 — 1); and since the latter 
quantities lie between zero and 2% — 1 we must have emodg = fmodg. (b) By 
part (a), (1-0? ee Om) (2° —1) 14 ee ete) . (24 1) = 2% —1 = 
2° — 1 = 2! — 1 = 1 (modulo 2 — 1). 


7. We have vjmj-1... Mı = uj —(vj-1mj—2...mit+---+01) and Cymj-1...m, = 1 
(modulo mj) by (23), (25), and (26); see P. A. Pritchard, CACM 27 (1984), 57. 


This method of rewriting the formulas uses the same number of arithmetic oper- 
ations and fewer constants; but the number of constants is fewer only if we order the 
moduli so that mı < m2 < -+ < Mr, otherwise we would need a table of m; mod mj. 
This ordering of the moduli might seem to require more computation than if we made 
mı the largest, m2 the next largest, etc., since there are many more operations to be 
done modulo m, than modulo m1; but since v; can be as large as m; — 1, we are better 
off with mı < mz < --- < Mr in (24) also. So this idea appears to be preferable to the 
formulas in the text, although Section 4.3.3B shows that the formulas in the text are 
advantageous when the moduli have the form (14). 


8. Modulo mj: mj-1...mivj = mj—1...mi(... ((uj—v1) e1j —v2) €2j —+ + 0j~1) X 
CGj—1)j = Mj—2...ma(... (uj — vi)ery — +++ — Uj 2)ej 2)j — Uj 1Mj-2---M ZI = 
Uj vi V2M1 tea Vj—-1ıMj-—2... Mı. 


9. ur + ((. <. (UrMr—1 + Ur-1) Mr—2 +) mai + v1) mod Mr, ..., 
ug + (vom, + vı) mod m2, ui + vı mod mı. 


(The computation should be done in this order, if we want to let uj and v; share the 
same memory locations, as they can in (24).) 


10. If we redefine the “mod” operator so that it produces residues in the symmetrical 
range, the basic formulas (2), (3), (4) for arithmetic and (24), (25) for conversion 
remain the same, and the number u in (25) lies in the desired range (10). (Here (25) is 
a balanced mized-radiz notation, generalizing balanced ternary notation.) The compar- 
ison of two numbers may still be done from left to right, in the simple manner described 
in the text. Furthermore, it is possible to retain the value u; in a single computer word, 
if we have signed magnitude representation within the computer, even if m; is almost 
twice the word size. But the arithmetic operations analogous to (11) and (12) are more 
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difficult, so it appears that this idea would result in slightly slower operation on most 
computers. 


11. Multiply by $(m +1) = (4(mı + 1), ..., 4(m,+1)). Note that 2t- 14 = ¢ 
(modulo m). In general if v is relatively prime to m, then we can find (by Euclid’s 
algorithm) a number v’ = (vj,...,v;) such that vv’ = 1 (modulo m); and then if u 


is known to be a multiple of v we have u/v = wv’, where the latter is computed with 
modular multiplication. When v is not relatively prime to m, division is much harder. 


12. Replace mj by m in (11). [Another way to test for overflow, if m is odd, is to 
maintain extra bits uo = u mod 2 and vo = v mod 2. Then overflow has occurred if and 
only if uo + vo # wi +---+ wr (modulo 2), where (w1,..., wr) are the mixed-radix 
digits corresponding to u + v.] 


13. (a) x? —x = (x— 1)r = 0 (modulo 10”) is equivalent to (x —1)a = 0 (modulo p”) 
for p = 2 and 5. Either x or x—1 must be a multiple of p, and then the other is relatively 
prime to p”; so either x or x — 1 must be a multiple of p”. If £ mod 2” = z mod 5” = 0 
or 1, we must have x mod 10” = 0 or 1; hence automorphs have x mod 2” 4 x mod 5”. 
(b) If £x = qp” +r, where r = 0 or 1, then r = r° =r’, so 32? — 2x? = (6qp"r + 3r) 
(6qp"r + 2r) = r (modulo p?”). (c) Let c be (3(cx)? — 2(ex)*) /a? = 3c? — 2c3x. 

Note: Since the last k digits of an n-digit automorph form a k-digit automorph, 
it makes sense to speak of the two oo-digit automorphs, x and 1 — x, which are 10-adic 
numbers (see exercise 4.1—31). The set of 10-adic numbers is equivalent under modular 
arithmetic to the set of ordered pairs (u1, u2), where ui is a 2-adic number and uz is a 
5-adic number. 


14. Find the cyclic convolution (zo, 21,...,2n—1) of floating point approximations to 
(aouo, a1U1,..., an—1Un—1) and (aovo, @11,...,@n—1Un—1), where the constants a, = 
g-(kamodn)/n have been precomputed. The identities u = ‘gee ukak2t1/" and v = 

aa vkak2"1/” now imply that w = fa tkak2"1/” where th © zr /Qr. If sufficient 
accuracy has been maintained, each tẹ will be very close to an integer. The represen- 
tation of w can readily be found from those integers. [R. Crandall and B. Fagin, Math. 
Comp. 62 (1994), 305-324. For improved error bounds, and extensions to moduli of 
the form k- 2” + 1, see Colin Percival, Math. Comp. 72 (2002), 387-395.] 


SECTION 4.3.3 
1. 12x 23: 34 x 41: 22 x 18: 1234 x 2341 : 
02 12 02 0276 

02 12 02 0276 

— 01 +03 +00 —0396 

06 04 16 1394 
06 04 16 1394 
0276 1394 0396 2888794 


2. /Q+|VQ| < VQ+ VQ< VQ4+2VQ4+1=VQ+1, so |VQ+R] < [VQ] +1. 


3. The result is true when k < 2, so assume that k > 2. Let qx = 22%, rp = 2Fk , so 
that Rk = |V Qk] and Qk = Qk-ı+Rp-1ı. We must show that 1+(Rp+1)2"* < 2@n-1, 
this inequality isn’t close at all. One way is to observe that 1+ (Rx +1)2%* < 142?" 
and 2Rk < Qk-ı when k > 2. (The fact that 2Rer < Qk-ı is readily proved by 
induction since Ry4i — Ry < 1 and Qk — Qx-1 > 2.) 


4.3.3 ANSWERS TO EXERCISES 633 


4. For j = 1, ..., r, calculate U.(j?), jUo(j7), Ve(j”), jVo(J?); and by recursively 
calling the multiplication algorithm, calculate 


W(j) = (Ue (j?) + JUP (Ve(97) + IV’), 
W(—j) = (Ue(3”) — JUP?) — IV’). 


y= 

y= 
Then we have We(j2) = (W(j) + W(-3))/2, Wold?) = (WU) — W(-3))/(23).. Also 
calculate W.(0) = U(0)V(0). Now construct difference tables for We and Wo, which 
are polynomials whose respective degrees are r and r — 1. 

This method reduces the size of the numbers being handled, and reduces the 
number of additions and multiplications. Its only disadvantage is a longer program 
(since the control is somewhat more complex, and some of the calculations must be 
done with signed numbers). 

Another possibility would perhaps be to evaluate W, and W, at 17, 27, 47, ..., 
(2")?; although the numbers involved are larger, the calculations are faster, since all 
multiplications are replaced by shifting and all divisions are by binary numbers of the 
form 27 (2% — 1). (Simple procedures are available for dividing by such numbers.) 


5. Start the q and r sequences out with qo and qı large enough so that the inequality 
in exercise 3 is valid. Then we will find in the formulas like those preceding Theorem B 
that we have ņı > 0 and 72 = (1+ 1/(2rp))21t 2r -Vr (Qk/Qkr+1). The factor 
Qk/Qk+1 > 1 as k > œ, so we can ignore it if we want to show that nz < 1 — € 


for all large k. Now V2Qi41 = y2Qr + 2[V2Qe] +2 > y (2Qr +2V2Qr +1) +1 > 
V2Qr +1+4+1/(3Rx). Hence no < (1 + 1/(2rk))271/ 2+), and Ign2 < 0 for large 
enough k. 

Note: Algorithm T can also be modified to define a sequence qo, q1, ... of a similar 
type that is based on n, so that n ~ qk + qk+ı after step T1. This modification leads 
to the estimate (21). 


6. Any common divisor of 6q+ dı and 6q+ d2 must also divide their difference d2 — dı. 
The ($) differences are 2, 3, 4, 6, 8, 1, 2, 4, 6, 1, 3, 5, 2, 4, 2, so we must only show 
that at most one of the given numbers is divisible by each of the primes 2, 3, 5. Clearly 
only 6q +2 is even, and only 6q + 3 is a multiple of 3; and there is at most one multiple 
of 5, since qk # 3 (modulo 5). 

7. Let pk—-ı < n < pp. We have tk < 6tk-1ı + ck3" for some constant C; SO tk /6" < 
tr—1/6®7! + ck/2* < to + CD; }/2 = M. Thus tk < M- 6” = O(pp® ô). 

8. False. To see the fallacy, try it with k = 2. 

9. &s = tgs) mod K- In particular, if q = —1 we get ûĉ(—r)moa K, Which avoids data- 
flipping when computing inverse transforms. 


10. All (Sk-1,--+,$k—j,tk—j-1,---,to) can be written 


5 gieti 5 wu) ( 5 atua), 


E E St 0<p<K 0<q<K 


and this is pia upv S(p, q), where |S(p,q)| = 0 or 27. We have |S(p,q)| = 2? for 
exactly 2?*/2) values of p and q. 


11. An automaton cannot have z2 = 1 until it has c > 2, and this occurs first for 
M; at time 37 — 1. It follows that M; cannot have z2z1z0 # 000 until time 3(j — 1). 
Furthermore, if M; has zo # 0 at time t, we cannot change this to zo = 0 without 
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affecting the output; but the output cannot be affected by this value of zo until at least 
time t+ j —1, so we must have t+7—1 < 2n. Since the first argument we gave proves 
that 3(j — 1) < t, we must have 4(j — 1) < 2n, that is, j — 1 < n/2, i.e., j < |n/2] +1. 
This is the best possible bound, since the inputs u = v = 2” — 1 require the use of 
M; for all j < |n/2| +1. (For example, Table 2 shows that M2 is needed to multiply 
two-bit numbers, at time 3.) 


12. We can “sweep through” K lists of MIX-like instructions, executing the first instruc- 
tion on each list, in O(K + (N log N)?) steps as follows: (i) A radix list sort (Section 
5.2.5) will group together all identical instructions, in time O(K +N). (ii) Each set of j 
identical instructions can be performed in O(log N)? + O(j) steps, and there are O(N”) 
sets. A bounded number of sweeps will finish all the lists. The remaining details are 
straightforward; for example, arithmetic operations can be simulated by converting p 
and q to binary. [SICOMP 9 (1980), 490-508.] 


13. If it takes T(n) steps to multiply n-bit numbers, we can accomplish m-bit times 
n-bit multiplication by breaking the n-bit number into [n/m] m-bit groups, using 
[n/m|T(m) + O(n + m) operations. The results cited in the text therefore give an 
estimated running time of O(n log mloglog m) on Turing machines, or O(n log m) on 
machines with random access to words of bounded size, or O(n) on pointer machines. 


15. The best upper bound known is O(n(log n)?” log log n), due to M. J. Fischer and 
L. J. Stockmeyer [J. Comp. and Syst. Sci. 9 (1974), 317-331]; their construction works 
on multitape Turing machines, and is O(nlogn) on pointer machines. The best lower 
bound known is of order nlogn/loglogn, due to M. S. Paterson, M. J. Fischer, and 
A. R. Meyer [SIAM/AMS Proceedings 7 (1974), 97-111]; this applies to multitape 
Turing machines but not to pointer machines. 


16. Let 2" be the smallest power of 2 that exceeds 2K. Set az + w Ply, and 


2 
bi + ulr =2=t) EA where u = 0 for t > K. We want to evaluate the convolutions 
Cr = Dojo Ajbr—j for r = 2K — 2 — s, when 0 < s < K. The convolutions can be found 


by using three fast Fourier transformations of order 2", as in the text’s multiplication 
procedure. [Note that this technique, sometimes called the “chirp transform,” works for 
any complex number w, not necessarily a root of unity. See L. I. Bluestein, Northeast 
Electronics Res. and Eng. Meeting Record 10 (1968), 218-219; D. H. Bailey and P. N. 
Swarztrauber, SIAM Review 33 (1991), 389-404. 


17. The quantity Dn = Kn+ı — Kn satisfies Dı = 2, Don = 2Dn, and Danii = 
Dy; hence Dp = 2¢1—-t+2 when n has the stated form. It follows that Kn = 31 + 
eae 3e12e1—-e1-!+3, by induction on n. 

Incidentally, Kn is odd, and we can multiply an n-place integer by an (n + 1)- 
place integer with (Kn + Kn+1)/2 1-place multiplications. The generating function 
K(z) = Ð „>; Kn2” satisfies zK(z) + 2° = K(z?)(z+ 1)(z +2); hence K(—1) = 1 and 
K(1) = &. 
18. The following scheme uses 3N + Sy places of working storage, where Sı = 0, 
Son = Sn, and Son-1 = Sn +1, hence Sn = e1 — er — t +2 — [t= 1] in the notation of 
the previous exercise. Let N = 2n—e, where ce is 0 or 1, and assume that N > 1. Given 
N-place numbers u = 2”U;+Up and v = 2"Vi+ Vo, we first form |Up —Ui| and |Vo—Vi| 
in two n-place areas starting at positions 0 and n of the (3N + Sy)-place working area. 
Then we place their product into the working area starting at position 3n + Sn. The 
next step is to form the 2(n — e)-place product UiVi, starting in position 0; using 
that product, we change the 3n — 2e places starting at position 3n + Sn to the value of 
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UıVı — (Uo — U1)(Vo — V1)+2” U1 Vi. (Notice that 3n—2e+3n+ Sn = 3N + Sy.) Finally, 
we form the 2n-place product UoVo starting at position 0, and add it to the partial 
result starting at positions 2n + Sn and 3n + Sn. We must also move the 2N-place 
answer to its final position by shifting it down 2n + S» positions. 

The final move could be avoided by a trickier variation that cyclically rotates its 
output by a given amount within a designated working area. If the 2N-place product is 
not allowed to be adjacent to the auxiliary working space, we need about N more places 
of memory (that is, a total of about 6N instead of 5N places, for the input, output, and 
temporary storage); see R. Maeder, Lecture Notes in Comp. Sci. 722 (1993), 59-65. 


19. Let m = s?+r where —s < r < s. We can use (2) with Ui = |u/s], Uo = u mod s, 
Vi = |v/s], Vo = v mod s, and with s playing the role of 2”. If we know the signs 
of Uı — Uo and Vi — Vo we know how to compute the product |Ui — Uo||Vi — Vol, 
which is < m, and whether to add or subtract it. It remains to multiply by s and 
by s? = —r. Each of these can be done with four multiplication /divisions, using 
exercise 3.2.1.1-9, but only seven are needed because one of the multiplications needed 
to compute sz mod m is by r or r+s. Thus 14 multiplication/divisions are sufficient (or 
12, in case u = v or u is constant). Without the ability to compare operands, we can 
still do the job with one more multiplication, by computing UoVı and U1 Vo separately. 


SECTION 4.4 


1. We compute (... (@mbm—1+@m—1)bm—2+-+:+a1)bo+a0 by adding and multiplying 
in the B; system. 


II 
II 
ee) 
— 
n 
ct 
I 
an 
A 
= 
= 
oa 
I 
an 
D 
N 
LL 
= 
= 


Start with zero 
Add 3 
Multiply by 24 
Add 9 
Multiply by 60 
Add 12 
Multiply by 60 
Add 37 


= 
oC oOoNnrMAWO o 


ecamnocncoco0 H 
bo 
wwnnNnsooookd 
El 
pF 
PrPowWaoandcaondc 
= 
wNocoeunnRod 


(Addition and multiplication by a constant in a mixed-radix system are readily done 
using a simple generalization of the usual carry rule; see exercise 4.3.1—9.) 


2. We compute |u/Bo]|, ||u/Bo|/Bi], etc., and the remainders are Ao, A1, etc. The 
division is done in the b; system. 


d. = 24(h. = 60(m. = 60 s.)) 
Start with u 3 9 12 37 
Divide by 16 0 5 4 32 Remainder = 5 
Divide by 14 0 0 21 45 Remainder = 2 
Divide by 8 0 0 2 43 Remainder = 1 
Divide by 20 0 0 0 8 Remainder = 3 
Divide by co 0 0 0 0 Remainder = 8 


Answer: 8 T. 3 cwt. 1 st. 2 lb. 5 oz. 


3. The following procedure due to G. L. Steele Jr. and Jon L White generalizes 
Taranto’s algorithm for B = 2 originally published in CACM 2,7 (July 1959), 27. 
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A1. [Initialize.] Set M + 0, Uo + 0. 


A2. [Done?] If u < € or u > 1 — e, go to step A4. (Otherwise no M-place fraction 
will satisfy the given conditions.) 


A3. [Transform.] Set M + M +1, U-m + |Bu], u + Bumod1, e + Be, and 
return to A2. (This transformation returns us to essentially the same state 
we were in before; the remaining problem is to convert u to U with fewest 
radix-B places so that |U — u| < e. Note, however, that € may now be > 1; 
in this case we could go immediately to step A4 instead of storing the new 
value of e.) 


AA. [Round.] If u > 4, increase U-m by 1. (If u = 4 exactly, another rounding 
rule such as “increase U-m by 1 only when it is odd” might be preferred; see 


Section 4.2.2.) J 


Step A4 will never increase U_m from B — 1 to B; for if U-m = B — 1 we must have 
M > 0, but no (M — 1)-place fraction was sufficiently accurate. Steele and White go 
on to consider floating point conversions in their paper [SIGPLAN Notices 25,6 (June 
1990), 112-126]. See also D. E. Knuth in Beauty is Our Business, edited by W. H. J. 
Feijen et al. (New York: Springer, 1990), 233-242. 

4. (a) 1/2* = 5/10”. (b) Every prime divisor of b divides B. 

5. If and only if 10” — 1 < c < w; see (3). 

7. au < ux < aut+u/w < au+1, hence [au] < [ux] < [au + 1]. Furthermore, in 
the special case cited we have ux < au + a and |au] = [au +a -— ej for0< e< a. 


8. ENT1 0 LDA TEMP (Can occur only on 

LDA U DECA 1 the first iteration, 

1H MUL =1//10= JMP 3B by exercise 7.) 

3H STA TEMP 2H STA ANSWER,1 (May be minus zero.) 
MUL =-10= LDA TEMP 
SLAX 5 INC1 1 
ADD U JAP 1B I 
JANN 2F 


(1 — 1/pr)(lu/2] + 1); 
0. Furthermore, since 
| = [u/10| must occur 


9. Let pk = gent? By induction on k we have vz(u) < 16 
hence |vp(u)/16| < [lu/2]/5] = [u/10] for all integers u > 
uglu + 1) > ve(u), the smallest counterexample to |v,(u)/16 
when u is a multiple of 10. 

Now let u = 10m be fixed, and suppose vg(u) mod pk = rk so that vk+ı(u) = 
vp(u) + (ve(u) — rk)/pr. The fact that p? = pk+ı implies that there exist integers mo, 
mi, M2, ... such that mo = m, ve(u) = (pe —1)Me4+ £k, and Mk = Mk+1Pk + Lk —Tk, 
where x41 = (pe + 1)£k — pere. Unwinding this recurrence yields 


k-1 

¿— 1 

vk(u) = (Pe — 1)Me + Ck Eor J[ œ+, ch = 3. 
i=j+1 


Furthermore vk(u) + Mk = vk+ı(u) + Mk+ı is independent of k, and it follows that 
vk(u)/16 = m + (3 — mg)/16. So the minimal counterexample u = 10y is obtained for 
0< k <4 by setting m, = 4 and r; = p; — 1 in the formula yp = 35 (Ur +m, — co). In 
hexadecimal notation, yp turns out to be the final 2” digits of 434243414342434. 
Since va(10y.) is less than 264 the same counterexample is also minimal for all 
k > 4. One way to work with larger operands is to modify the method by starting with 
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vo(u) = 6|u/2| +6 and letting ck = 6(px — 1)/(po — 1), Mo = 2m. (In effect, we are 
truncating one bit further to the right than before.) Then |v,(u)/32| = |u/10| when 
u is less than 10z,, for 1 < k < 7, where zk = 35 (Uk + mr — 6) when mz = 7, ro = 14, 
and rj = pj — 1 for j > 0. For example, z4 = 1¢342c3424342c34. [This exercise is 
based on ideas of R. A. Vowels, Australian Comp. J. 24 (1992), 81-85.] 


10. (i) Shift right one; (ii) Extract the left bit of each group; (iii) Shift the result 
of (ii) right two; (iv) Shift the result of (iii) right one, and add it to the result of (iii); 
(v) Subtract the result of (iv) from the result of (i). 
11. 5.772 1 

-10 
7.721 
9 4 


af 
waa 
24529 Answer: (24529)10. 


12. First convert the ternary number to nonary (radix 9) notation, then proceed as 
in octal-to-decimal conversion but without doubling. Decimal to nonary is similar. In 
the given example, we have 


1.764723 
=l 
1664723 9.87654 
— 16 + 9 
150.4723 118.7654 
— 150 + 118 
1354.723 1316654 
— 1354 + 1316 
1219 3.23 14483.54 
— 12193 + 14483 
109739.3 160428.4 
— 109739 + 160428 
987654 Answer: (987654)10. 1764723 Answer: (1764723)9. 
13. BUF ALF uuu (Radix point on first line) 
ORIG *+39 
START JOV OFLO Ensure that overflow is off. 
ENT2 -40 Set buffer pointer. 
8H ENT3 10 Set loop counter. 
1H ENT1 m Begin multiplication routine. 
ENTX 0 


2H STX CARRY 
(See exercise 4.3.1-13, with 


JiP 2B v = 10° and W = U.) 
SLAX 5 rA <+ next nine digits. 
CHAR 


STA BUF+40,2(2:5) Store next nine digits. 
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STX BUF+41,2 


INC2 2 Increase buffer pointer. 

DEC3 1 

J3P 1B Repeat ten times. 

OUT BUF+20,2(PRINTER) 

J2N 8B Repeat until both lines are printed. J 


14. Let K(n) be the number of steps required to convert an n-digit decimal number 
to binary and at the same time to compute the binary representation of 10”. Then 
we have K(2n) < 2K(n) + O(M(n)). Proof. Given the number U = (uan—1...Uo)10, 
compute U; = (uan-1---Un)io and Up = (tn-1..-Uo)io and 10”, in 2K (n) steps, then 
compute U = 10"U; + Up and 107” = 10” - 10” in O(M(n)) steps. It follows that 
K(2") = O(M(2") + 2M(2""') + 4M(2”"?) +--+) = O(nM(2")). 

[Similarly, Schénhage has observed that we can convert a (2” lg 10)-bit number U 
from binary to decimal, in O(nM(2”)) steps. First form V = 102"~* in O(M(2”~") + 
M(2"~?) + ---) = O(M(2")) steps, then compute Uo = (U mod V) and U; = [U/V] 
in O(M(2")) further steps, then convert Up and U;.] 

17. See W. D. Clinger, SIGPLAN Notices 25,6 (June 1990), 92-101, and the paper 
by Steele and White cited in the answer to exercise 3. 

18. Let U = roundg(u, P) and v = round (U, p). We may assume that u > 0, so that 
U>Oandv>0. Case 1: v < u. Determine e and E such that b! < u < b$, 
BE-' < U < B”. Then u < U + ¿B7? and U < u— b°”; hence B?! < 
BP-ĦU < BP-Fu < bP™°u < bP. Case 2: v > u. Determine e and E such that 
b°! < u < b°, B77! < U < B®. Then u > U—iB*-” and U > u+ }b°7?; hence 
B?! < BP-Ħ(U — B-P?) < BP-®u < bP™°u < b”. Thus we have proved that 
BP-! < bP whenever v £ u. 

Conversely, if B’~' < b”, the proof above suggests that the most likely example 
for which u ¥ v will occur when u is a power of b and at the same time it is close to a 
power of B. We have B?~'b? < BY~'b? + $b? — $B? — 4 = (B?! + 5)(b? — 5); 
hence 1 < a= 1/(1 — 4b7”) <1+45B'~” = $. There are integers e and E such that 
log, a < clog, b— E < log, B, by exercise 4.5.3-50. Hence a < b°/B® < 8, for some e 
and E. Now we have roundg(b®, P) = B®, and round,(B®, p) < b°. [CACM 11 (1968), 
47-50; Proc. Amer. Math. Soc. 19 (1968), 716-723.] 

For example, if b? = 21° and B? = 104, the number u = 2°48 ~ .100049 - 101°2° 
rounds down to U = .1- 101989 ~ (.111111111101111111111), - 2®4°8, which rounds 
down to 26408 — 26398, (The smallest example is actually round((.1111111001)2-27**) = 
.1011 - 10%, round(.1011 - 107°) = (.11111110010)2 - 2754, found by Fred J. Tydeman.) 
19. mı = (FOFOFOFO)i6, cı = 1 — 10/16 makes U = ((u7us)10 - . . (U1 U0) 10) o563 then 
m2 = (FFOOFF00)i6, co = 1 — 10°/167 makes U = ((u7u6usua)10(usu2t U0) 10) 655963 
and m3 = (FFFF0000)16, c3 = 1 — 104/164 finishes the job. [Compare with Schénhage’s 
algorithm in exercise 14. This technique is due to Roy A. Keir, circa 1958.] 


SECTION 4.5.1 

1. Test whether or not uv’ < u’v, since the denominators are positive. (See also the 
answer to exercise 4.5.3-39.) 

2. If c > 1 divides both u/d and v/d, then cd divides both u and v. 

3. Let p be prime. If pê is a divisor of uv and u’v’ for e > 1, then either p*\u 
and p°\v’ or p®\u’ and p*\v; hence p°\ gcd(u, v’) gcd(u’, v). The converse follows by 
reversing the argument. 
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4. Let di = gcd(u,v), d2 = gcd(u’,v’); the answer is w = (u/d1)(v'/d2)sign(v), 
w' = |(u’'/d2)(v/d1)|, with a “divide by zero” error message if v = 0. 

5. dı = 10, t= 17 -7 — 27-12 = —205, dy = 5, w = —41, w = 168. 

6. Let u” = u'/dı, v” = v'/di; our goal is to show that gcd(uv” + u”v, di) = 
gced(uv” + u”v, diu”v”). If p is a prime that divides u”, then p does not divide u or v”, 
so p does not divide uv” + u”v. A similar argument holds for prime divisors of v”, so 


no prime divisors of u”v” affect the given gcd. 


7. (N — 1) + (N — 2)? = 2N? — (6N — 5). If the inputs are n-bit binary numbers, 
2n + 1 bits may be necessary to represent t. 


8. For multiplication and division these quantities obey the rules z/0 = sign(x)oo, 
(+00) x z = x x (+00) = (t00)/x = tsign(x)oo, x/(+00) = 0, provided that z is finite 
and nonzero, without change to the algorithms described. Furthermore, the algorithms 
can readily be modified so that 0/0 = 0 x (+00) = (400) x 0 = “(0/0)”, where the latter 
is a representation of “undefined.” If either operand is undefined the result should be 
undefined also. 

Since the multiplication and division subroutines can yield these fairly natural 
rules of extended arithmetic, it is sometimes worthwhile to modify the addition and 
subtraction operations so that they satisfy the rules x + oo = too, x + (—oo) = Foo, 
for x finite; (+00) + (+00) = +too — (Foo) = too; furthermore (+00) + (Foo) = 
(+00) — (+00) = (0/0); and if either or both operands are (0/0), the result should also 
be (0/0). Equality tests and comparisons may be treated in a similar manner. 

The remarks above are independent of “overflow” indications. If oo is being used 
to suggest overflow, it is incorrect to let 1/oo be equal to zero, lest inaccurate results 
be regarded as true answers. It is far better to represent overflow by (0/0), and to 
adhere to the convention that the result of any operation is undefined if at least one of 
the inputs is undefined. This type of overflow indication has the advantage that final 
results of an extended calculation reveal exactly which answers are defined and which 
are not. 

9. If u/u' A v/v’, then 1 < Juv’ — u'v| = u'v'|u/u’ — v/v"| < |2??u/u! — 2?" v/v'|; two 
quantities differing by more than unity cannot have the same “floor.” (In other words, 
the first 2n bits to the right of the binary point are enough to characterize the value 


of a binary fraction, when there are n-bit denominators. We cannot improve this to 


2n — 1 bits, for if n = 4 we have ṣẹ = (.00010011...)2, 7 = (.00010010...)2.) 


— 


— 


9 


11. To divide by (v + '/5)/v", when v and v’ are not both zero, multiply by the 
reciprocal, (v — v'V/5 )v"/(v? — 5v’”), and reduce to lowest terms. 


12. ((297t — 1)/1); round(a) = (0/1) if and only if |z| < 2'~*. Similarly, round(«) = 
(1/0) if and only if x > 297+. 


13. One idea is to limit numerator and denominator to a total of 27 bits, where we 
need only store 26 of these bits (since the leading bit of the denominator is 1 unless 
the denominator has length 0). This leaves room for a sign and five bits to indicate 
the denominator size. Another idea is to use 28 bits for numerator and denominator, 
which are to have a total of at most seven hexadecimal digits, together with a sign and 
a 3-bit field to indicate the number of hexadecimal digits in the denominator. 

[Using the formulas in the next exercise, the first alternative leads to exactly 
2140040119 finite representable numbers, while the second leads to 1830986459. The 
first alternative is preferable because it represents more values, and because it is cleaner 
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and makes smoother transitions between ranges. With 64-bit words we would, similarly, 
limit numerator and denominator to a total of at most 64 — 6 = 58 bits.] 


14. The number of multiples of n in the interval (a..b] is |b/n| — [a/n]. Hence, by 
inclusion and exclusion, the answer to this problem is So — S1 + S2 —---, where Sx is 
Yo([Me2/P| — |Mi/P]|)((N2/P| — |N1/P]|), summed over all products P of k distinct 
primes. We can also express the answer as 


min(M2,No2) 


SO u(r) (LMa/n| — [Ma /nJ) (LN2/n| — [N1/n]). 


n=1 


SECTION 4.5.2 


1. Substitute min, max, + consistently for gcd, lem, x, respectively (after making 
sure that the identities are correct when any variable is zero). 


2. For prime p, let up, Vip, ..., Unp be the exponents of p in the canonical factor- 
izations of u, vi, ..., Un. By hypothesis, up < vip +--+ + Unp. We must show that 
Up < min(up, Vip) +--+: + min(up, Vnp), and this is certainly true if up is greater than 
or equal to each vp, or if up is less than some vjp. 

3. Solution 1: If n = pî"... p7", the number in each case is (2e1 + 1)... (2er + 1). 
Solution 2: A one-to-one correspondence is obtained if we set u = gcd(d,n) and v = 
n?/Icm(d,n) for each divisor d of n?. [E. Cesàro, Annali di Matematica Pura ed Ap- 
plicata (2) 13 (1885), 235-250, §12.] 

4. See exercise 3.2.1.2-15(a). 


5. Shift u and v right until neither is a multiple of 3, remembering the proper power 
of 3 that will appear in the gcd. Each subsequent iteration sets t + u +v or t + u—v 
(whichever is a multiple of 3), shifts t right until it is not a multiple of 3, then replaces 
max(u,v) by the result. 


u v t 


13634 24140 10506, 3502; 
13634 3502 17136, 5712, 1904; 
1904 3502 5406, 1802; 
1904 1802 102, 34; 
34 1802 1836, 612, 204, 68; 
34 68 102, 34; 
34 34 0. 


The evidence that gcd(40902, 24140) = 34 is now overwhelming. 
6. The probability that both u and v are even is 4 the probability that both are 


multiples of four is 5i etc. Thus A has the distribution given by the generating function 
3 3 3 2 3/4 
H 2+4 zZ + = : 
4 16 64 1— 2/4 


The mean is 3, and the standard deviation is vy = + — J = 2. If u and v are 
independently and uniformly distributed with 1 < u,v N some small correction 
terms are needed; the mean is then actually 


(25 —1)-? $0 (277 y= Cal NO” a4) 
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7. When u and v are not both even, each of the cases (even, odd), (odd, even), (odd, 
odd) is equally probable, and B = 1, 0, 0 in these cases. Hence B = i on the average. 
Actually, as in exercise 6, a small correction should be given to be strictly accurate 
when 1 < u,v < 2%; the probability that B = 1 is actually 


QN 1)? DD (207E — yank = 


8. Let F be the number of subtraction steps in which u > v; then E = F + B. If 
we change the inputs from (u,v) to (v, u), the value of C stays unchanged, while F 


becomes C — 1 — F. Hence Eave = 4 (Cave — 1) + Bave- 


9. The binary algorithm first gets to B6 with u = 1963, v = 1359; then t + 604, 302, 
151, etc. The gcd is 302. Using Algorithm X we find that 2 - 31408 — 23 - 2718 = 302. 


10. (a) Two integers are relatively prime if and only if they are not both divisible by 
any prime number. (b) Rearrange the sum in (a), with denominators k = pi...pr. 
(Each of the sums in (a) and (b) is actually finite.) (c) Since (n/k)? — |n/k]? = 
O(n/k), we have qn — Pr w(k)(n/k)? = Er O(n/k) = O(nH,,). Furthermore 
Ersn(n/k)}? = O(n). (d) Xan Uld) = bin. [In fact, we have the more general result 


o EG) 


as in part (b), where the sums on the right are over the prime divisors of n, and this 
is equal to n°(1 — 1/pj)...(1 — 1/p8) if n = pi)... pt] 

Notes: Similarly, we find that a set of k integers is relatively prime with probability 
1/¢(k) = 1/(2,5, 1/n*). This proof of Theorem D is due to F. Mertens, Crelle 77 
(1874), 289-291. The technique actually gives a much stronger result, namely that 
62~2mn + O(nlog m) pairs of integers u € [f(m) .. f(m) +m), v € [g(n)..g(n) +n) 
are relatively prime, when m < n, f(m) = O(m), and g(n) = O(n). 

11. (a) 6/7? times 1+ $+ į, namely 49/(67°) ~ .82746. (b) 6/m? times 1/1 +2/4+ 
3/9 +--+, namely oo. (This is true in spite of the results of exercises 12 and 14.) 

12. [Annali di Mat. (2) 13 (1885), 235-250, §3.] Let a(n) be the number of positive 
divisors of n. The answer is 


6 6 1\? r 
Yo) = S(T) =p 


k>1 


(Fa, 


2 
3 


wI = 


[Thus, the average is less than 2, although there are always at least two common 
divisors when u and v are not relatively prime.] 


13. 1+ġ+ġ+ e =l+i+i+ ¢(1+9+ 5+-::). 


14. (a) L = (6/n?) 45,47 Ind = —C'(2)/C(2) = E, prime (In p)/(p? — 1) ~ 0.56996. 
(b) (8/7) Yas, [dodd] d~? nd = L — 4 In2 ~ 0.33891. 


15. vı = +v/uz, v2 = Fu/uz (the sign depends on whether the number of iterations 
is even or odd). This follows from the fact that vı and v2 are relatively prime to each 
other (throughout the algorithm), and that viu = —vev. [Hence viu = lem(u, v) at the 
close of the algorithm, but this is not an especially efficient way to compute the least 
common multiple. For a generalization, see exercise 4.6.1-18.] 

Further details can be found in exercise 4.5.3-48. 
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16. Apply Algorithm X to v and m, thus obtaining a value x such that zv = 1 
(modulo m). (This can be done by simplifying Algorithm X so that u2, v2, and t2 are 
not computed, since they are never used in the answer.) Then set w < ux mod m. [It 
follows, as in exercise 4.5.3-45, that this process requires O(n?) units of time, when it is 
applied to large n-bit numbers. See exercises 17 and 39 for alternatives to Algorithm X.] 


17. We can let u’ = (2u — vu?) mod 2”°, as in Newton’s method (see the end of Sec- 
tion 4.3.1). Equivalently, if uv = 1+2°w (modulo 2°), let u’ = u+ 2°((—uw) mod 2°). 
18. Let u1, u2, us, v1, v2, v3 be multiprecision variables, in addition to u and v. The 
extended algorithm will act the same on ug and v3 as Algorithm L does on u and v. New 
multiprecision operations are to set t + Auj, t << t+ Bvj, w + Cuj, w + w + Dvj, 
uj + t, vj + w for all j, in step L4; also if B = 0 in that step to set t — uj — quj, 
uj 4 vj, vj + t for all j and for q = |u3/v3]. A similar modification is made to step L1 
if v3 is small. The inner loop (steps L2 and L3) is unchanged. 

19. (a) Set tı = r+2y+3z; then 3tı +y+2z = 1, 5tı -3y—20z = 3. Eliminate y, then 
14tı — 14z = 6: No solution. (b) This time 14tı — 14z = 0. Divide by 14, eliminate t1; 
the general solution is x = 8z — 2, y = 1 — 5z, z arbitrary. 


20. We can assume that m > n. If m > n = 0 we get to (m — t,0) with probability 
27* for 1 < t < m, to (0,0) with probability 2'~™. Valida vi, the following values can 
be obtained for n > 0: 

Case 1,m = n. From (n,n) we go to (n—t, n) with probability t/2*—5/2**1+3/2”, 
for 2 < t < n. (These values are b a. a. ....) To (0,n) the probability is 
n/2"-! — 1/2"? + 1/2?" To (n,k) the probability is the same as to (k,n). The 
algorithm terminates with probability 1/2”~’. 

Case 2, m =n-+1. From (n+1,n) we get to (n,n) with probability $ when n > 1, 
or 0 when n = 1; to (n — t, n) with probability 11/2'+? — 3/271, for 1 <t<n-—-1. 


(These values are 4, $, 4%, ----) We get to (1,m) with probability 5/att —3/227-1, 
for n > 1; to (0,n) with probability 3/2" — 1/2?"71. 
Case 8,m >n-+ 2. The probabilities are given by the following table: 
m—1,n): 1/2 —3/2-"*? — 691/21: 
m-t, n): 1/2 43/2070, 1<t<n; 
m-n, n): 1/2” + 1/2”, n> l; 
(m—n-—t,n): 1/2°** + 6/2", l<t<m-n; 
0,n): 1/271. 


The only thing interesting about these results is that they are so messy; but that 
makes them uninteresting. 


21. Show that for fixed v and for 2” < u < 2™*!, when m is large, each subtract- 
and-shift cycle of the algorithm reduces |lg u| by two, on the average. 


22. Exactly (N — m)2™~1+%mo integers u in the range 1 < u < 2% have |lgu| = m, 
after u has been shifted right until it is odd. Thus 
(2% —1)?C = N?Co0 +2N X. (N -—n)2"™" Cno 
1<n<N 


+2 XO (N=m)(N=7)2"7" 7 Cinn + SO (N= n)? 7 Cnn. 


l<n<m<N 1<n<N 


(The same formula holds for D in terms of Dmn.) 
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The middle sum is 2° -? yy encn ™n2—-™ "(a+ B)N +y- am — Bn). Since 


XO m™ =2- (n41) and XO m(m—1)27" =4- (n? +n4+2)2°™, 


O0<m<n O0<m<n 


the sum on m is 
ale 5 n2-"((y a—Bn+(at+8)N)(2—(n4 1)2*-")—a(4—(n?-+n-+2)21-")) 


O0<n<N 
— 92N-2 ((a+8)N X n2™”(2— (n+1)2'™")+0(1)) 


n>0 


Thus the coefficient of (a + 8)N in the answer is found to be 2~7(4 — (4)°) = H. 
A similar argument applies to the other sums. 

Note: The exact value of the sums may be obtained after some tedious calculation 
by means of the general summation-by-parts formula 


pm ok m! z™ m mën =E gtk 
Sete a 


O<k<n k=0 


23. Ifa <1litis Pr(u>vand v/u < x) = $(1-Gn(2)). And if x > litis $+Pr(u<v 
and v/u > 1/) = $ + $Gn(1/z); this also equals ¿(1 — Gn(x)) by (40). 

24. 3>,.,2°-"G(1/(2* + 1)) = S(1). This value, which has no obvious connection to 
classical constants, is approximately 0.5432582959. 


25. Richard Brent has noted that G(e~”) is an odd function that is analytic for all 
real values of y. If we let G(e~”) = Ary + Asy? + Asy? +- = p(e~¥ — 1), we have 
pi =A = À, p2 = $À, ps = 5A H às, pa = ZA 3A3, ps = 5A H TAs H As} 


A n] k! n) k! 
D'o =|, Ew EDD ME 


The first few values are 41 œ~ .3979226812, A3 ~ —.0210096400, As œ~ .0013749841, 
Az = —.0000960351. Wild conjecture: limp00(—A2Kn+1/A2K-1) = 1/n?. 
26. The left side is 25(1/x)—55(1/2x)+25(1/4%)—25(x)+55\(2x)—25(42) by (39); the 
right side is S(2x%) —2S(4x)+25(1/x)—S(1/2x) —2S(«x) +45 (2x) —45(1/2x)+25(1/42) 
by (44). The cases z = 1, x = 1/2, and x = ¢ are perhaps the most interesting; for 
example, x = ¢ gives 2G(4@) — 5G(2¢) + G(¢?/2) — G(¢?) = 2G (24°). 


27. 2n = [2"] z» ok>0 paes DF a Dive iz/2" y = Š k>12 TENAN Ea a Oo = 
esl ghee) burs (7) Bakr /n by exercise 1.2.11.2-4, when n > 1; and of course 
ya oO = 1/24 —1). 

28. Letting Sn(m) = Pza (1 — k/m)” and Tn(m) = 1/(e"/™ — 1) as in exercise 
6.3-34(b), we find S;,(m) = Tn(m) + O(e7"/ n/m?) and 2n41 = 0351 2777 Sn(2’) = 
™m+O(n-%), where Tn = X> 2-79 T,, (27). Since Tn41 < Tn and 4Tən — Tn = 1/(e"—1) 


is positive but exponentially small, it follows that mn = O(n’). More detailed 
information can be obtained by writing 


D 1 oe I(z)n-* HE 1 [a (z) (z)n7? j 
= 273 Te ~ Oni 3 Qi2-) 2ri Js 22-2] ` 


/2—ico /[2—ico 
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The integral is the sum of the residues at the poles 2 + 27ik/In2, namely n~? times 
n*/(6In 2) + f(n), where 

f(n) = 25° R(¢(2 + 2rik/ln 2) r (2 + 2wik/ In 2) exp(—2ik lg n)/ln 2) 

k>1 

is a periodic function of lg n whose “average” value is zero. 
29. (Solution by P. Flajolet and B. Vallée.) If f(x) = 0,5, 27"g(2"x) and g*(s) = 
Jy g(a)a** da, then f*(s) = Eps, 2 °C g*(s) = g*(s)/(2°** — 1), and f(z) = 
s Mies f*(s)a *ds under appropriate conditions. Letting g(x) = 1/(1 + x), we find 
that the transform in this case is g*(s) = m/sin ms when 0 < Rs < 1; hence 


S11 1 fer nads 
[e= Type =a 2°41 J) sin rs” 
~ x Ti Si j2-ico ( —1)sinzs 
It follows that f(x) is the sum of the residues of -*—x~*/(2°t'—1) for Rs < 0, namely 
l+algx+5x+2xP(lgx) — 207 + $r’ — $2*+---, where 
Ir S sin 2mmt 
P(t)= 
(#) In2 2 sinh(2m7?/In 2) 


is a periodic function whose absolute value never exceeds 8 x 1071”. (The fact that 
P(t) is so small caused Brent to overlook it in his original paper.) 

The Mellin transform of f(1/x) is f*(—s) = 7/((1—21~*) sin rs) for —1 < Rs < 0; 
thus f(1/z) = 55 (= T *ds/(1 — 2'~*), and we now want the residues of 


1/2—iœo sinas 


the integrand with Rs < —1: f(1/x) = $a — ir? +---. [This formula could also have 


been obtained directly.] We have Si(x) = 1 — f(x), and it follows that 


2 


£ 
1+ 


Gi(z) = f(a) — #012) = sigs + 52 + 2Pliga) - -7 + (1-2) (a), 


where $(x) = D &o(—1)"x"/(25+1 — 1). 
30. We have G(x) = Xi (£) — X1 (1/£) + X2(x) — X2(1/x), where 


1 1 1 1 
dn (a2) = 5 RFI i py? X(x) = >» k TL Oke," 
Ta? 1+ 2'(1 + 2x) TE 1+2 + 2ka 
The Mellin transforms are Xł (s) = ="—a(s)/(2*t'—1), 53(s) = == W(s)/(2°t'—-1), 


sin Ts sin Ts 
where 


DHR -D(a 


1>1 k>0 
l s—1 s—1 1 
9) = ety = D E ae 
1>1 k>0 


Therefore we obtain the following expansions for 0 < x < 1: 


X(x) = a(0) + a(—1)a(lg e+) — a'(1)£/ln 2 + wA(Ig x) er k)(—2)*, 


k>2 


S2(2) = 6(0) + (-1)2(lga+4) — 0 (1)s/m2 + eBge) -S zer lk) a), 


k>2 
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Qh 
k>1 
(—a)* 1 1 Hri, 
52(1/z) 5 Qkt+1 _ J Iga b(k) 2 2k+1_] h2 | P(g zx) , 
k>1 
feet 1 
We) = OC. Jac 
k=0 
= 1 2ri f ° —2mrit 
A(t) = FE L E a(—1+ 2mmi/In 2) e i: 
Bit) = — 5 R A b(—1 + 2mri/ln2) ecm 
~ In2 “> \sinh(2mr?/In 2) ; 


> 
S 
l 
Ej- 
Mı 
8 


mt (* -1- es oe 
sinh(2m7/In 2) k-1 f 


32. Yes: See G. Maze, J. Discrete Algorithms 5 (2007), 176-186. 


34. Brigitte Vallée [Algorithmica 22 (1998), 660-685] has found an elegant and rig- 
orous analysis of Algorithm B, using an approach quite different from that of Brent. 
Indeed, her methods are sufficiently different that they are not yet known to predict the 
same behavior as Brent’s heuristic model. Thus the problem of analyzing the binary 
gcd algorithm, now solved rigorously for the first time, continues to lead to ever more 
tantalizing questions of higher mathematics. 

35. By induction, the length is m+|n/2|+1—[m=n=1] when m > n. But exercise 37 
shows that the algorithm cannot go as slowly as this. 

36. Let an = (2” — (—1)”)/3; then ao, ai, a2, ... = 0, 1, 1, 3, 5, 11, 21, .... 
(This sequence of numbers has an interesting pattern of zeros and ones in its binary 
representation. Notice that an = an—ı + 2an—2, and an + Gn41 = 2".) For m > n, let 
u=2™t1 anys, V = an42. For m = n > 0, let u = an+2 and v = u4 (—1)”. Another 
example for the case m = n > 0 is u = 2”! — 2, v = 2”+! — 1; this choice takes more 
shifts, and gives B = 1, C = n + 1, D = 2n, E = n, the worst case for Program B. 


37. (Solution by J. O. Shallit.) This is a problem where it appears to be necessary to 
prove more than was asked just to prove what was asked. Let S(u,v) be the number 
of subtraction steps taken by Algorithm B on inputs u and v. We will prove that 
S(u,v) < lg(u +v). This will imply that S(u,v) < |lg(u + v)] < |Ig2max(u,v)] = 
1+ |lg max(u, v)| as desired. 

Notice that S(u,v) = S(v,u). If u is even, S(u,v) = S(u/2,v); hence we may 
assume that u and v are odd. We may also assume that u > v, since S(u, u) = 1. Then 
S(u,v) = 1 + S((u—v)/2,v) < 1 + lg((u — v)/2 + v) = lg(u + v) by induction. 

It follows, incidentally, that the smallest case requiring n subtraction steps is 
u= 27t +1, v = 271 — 1. 

38. Keep track of the most significant and least significant words of the operands (the 
most significant is used to guess the sign of t and the least significant is to determine the 
amount of right shift), while building a 2 x 2 matrix A of single-precision integers such 
that A(*) = pee where w is the computer word size and where u’ and v’ are smaller 
than u and v. (Instead of dividing the simulated even operand by 2, multiply the other 
one by 2, until obtaining multiples of w after exactly lg w shifts.) Experiments show 
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this algorithm running four times as fast as Algorithm L, on at least one computer. 
With the similar algorithm of exercise 40 we don’t need the most significant words. 

A possibly faster binary algorithm has been described by J. Sorenson, J. Algo- 
rithms 16 (1994), 110-144; Shallit and Sorenson, Lecture Notes in Comp. Sci. 877 
(1994), 169-183. 


39. (Solution by Michael Penk.) Assume that u and v are positive. 


Y1. [Find power of 2.] Same as step B1. 

Y2. [Initialize.] Set (u1, u2, u3) < (1,0,u) and (v1, v2, v3) 4+ (v,1—u,v). If u is 
odd, set (t1,t2,t3) < (0,—1,—v) and go to Y4. Otherwise set (t1, t2,t3) < 
1,0, u). 

Y3. [Halve t3.] If tı and t2 are both even, set (t1, t2,t3) < (t1, te, t3)/2; otherwise 
set (t1, to, ts) < (tı +v, t2 — u, t3)/2. (In the latter case, tı +v and tz — u will 
both be even.) 

Y4. [Is t3 even?] If t3 is even, go back to Y3. 


Y5. [Reset max(u3,u3).] If t3 is positive, set (u1, u2, u3) + (t1, t2,t3); otherwise 
set (v1, v2, v3) < (v — t1, —u — te, —t3). 

Y6. [Subtract.] Set (t1,t2,t3) < (u1,u2, us) — (vi, v2,03). Then if tı < 0, set 
ti, te) + (tı +: v, te — u). If t3 4 0, go back to Y3. Otherwise the algorithm 
terminates with (u1,u2,u3-2") as the output. I 


It is clear that the relations in (16) are preserved, and that 0 < wi,vi,ti < v, 
0 > ua,v2,te > —u, 0 < u3 < u, 0 < v3 < v after each of steps Y2-Y6. If u is odd 
after step Y1, then step Y3 can be simplified, since tı and t2 are both even if and only 
if t2 is even; similarly, if v is odd, then tı and t2 are both even if and only if tı is even. 
Thus, as in Algorithm X, it is possible to suppress all calculations involving u2, v2, and 
t2, provided that v is odd after step Y1. This condition is often known in advance (for 
example, it holds when v is prime and we are trying to compute u~' modulo v). 

See also A. W. Bojanczyk and R. P. Brent, Computers and Math. 14 (1987), 233, 
for a similar extension of the algorithm in exercise 40. 


40. Let m = lgmax(|ul,|v|). We can show inductively that |u| < 2 ~@7°/?, Ju] < 
gm—(s+°)/2 after we have performed the operation c + c+ 1 in step K3 s times. 
Therefore s < 2m. If K2 is executed t times, we have t < s+ 2, because s increases 
every time except the first and last. [See VLSI ’83 (North-Holland, 1983), 145-154.] 

Notes: When u = 1 and v = 3-2" — 1 and k > 2, we have m = k +2, s = 2k, 
t=k+4. When u = uj and v = 2uj—ı in the sequence defined by uo = 3, ui = 1, 
uj+ı = min(|3u; — 16u;—1|, 5u; — 16u;-1|), we have s = 2j + 2, t = 2j + 3, and 
(empirically) m ~ ¢j. Can t be asymptotically larger than 2m/¢? 

41. In general, since (a” — 1) mod (a? — 1) = a” ™°1” —1 (see Eq. 4.3.2-(20)), we find 
that ged(a™ — 1, a” — 1) = a8°4™"™ — 1 for all positive integers a. 
42. Subtract the kth column from the 2kth, 3kth, 4kth, etc., for k = 1, 2, 3,.... The 
result is a triangular matrix with x, on the diagonal in column k, where m = > d\m Td- 
It follows that £m = y(m), so the determinant is y(1)y(2)... p(n). 

[In general, “Smith’s determinant,” in which the (i, j) element is f(gced(i, j)) for 
an arbitrary function f, is equal to [m 1 X aym H(m/d)f(d), by the same argument. 
See L. E. Dickson, History of the Theory of Numbers 1 (Carnegie Inst. of Washington, 
1919), 122-123.] 
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SECTION 4.5.3 
1. The running time is about 19.02T + 6, just a trifle slower than Program 4.5.2A. 
2. es, Tayong Eni; Pa) Kn-1(21, £2, bae e 
Koala 0015 iniia) Kn-2(£2,...,En—1) 
3. Kantin): 
4. By induction, or by taking the determinant of the matrix product in exercise 2. 


5. When the 2’s are positive, the q’s of (9) are positive, and qn+1 > qn-—1; hence (9) 
is an alternating series of decreasing terms, and it converges if and only if qnqn+1 — oo. 
By induction, if the z’s are greater than €, we have qn > (1+ €/2)”c, where c is chosen 
small enough to make this inequality valid for n = 1 and 2. But if zn = 1/2", we have 
qn < 2 — 1/2”. 

6. It suffices to prove that Aı = Bı; and from the fact that 0 < //£1,...,£n// < 1 
whenever z1, ..., Zn are positive integers, we have Bı = |1/X] = Ai. 

7. Only 12...n and n...21. (The variable x, appears in exactly Fe Pn+i-k terms; 
hence zı and z, can only be permuted into xı and zn. If zı and £n are fixed by the 
permutation, it follows by induction that x2, ..., £n—1ı are also fixed.) 


8. This is equivalent to 


Kn-2(An-1,..., A2) = IGA ages ac A1) 1 
Kni lAn, ..., A2) — X Ka (An, ..., A1) Xn’ 


and by (6) it is equivalent to 


Kn-1(A2, ae .; An) + Xn Kn—2(A2, ae .; An-1) 
Kyn(Ai, eon ,An) + XnKn—-1(A1, Sede ,An-1) ` 


9. (a) By definition. (b,d) Prove this when n = 1, then apply (a) to get the result 
for general n. (c) Prove it when n = k + 1, then apply (a). 
10. If Ao > 0, then Bo = 0, Bi = Ao, B2 = Ai, B3 = A2, Bs = Az, Bs = As, m = 5. 
If Ao = 0, then Bo = Ai, Bı = Az, Bo = A3, B3 = Aa, m = 3. If Ao = —1 and A; = 1, 
then Bo = (A2 t 2), By =1, Bə = A; — 1, B3 = Ag, m = 3. If Ao = —1 and A; > 1, 
then Bo = —2, Bi = 1, Bo = Ai — 2, B3 = Ao, B4 = A3, Bs = A4, m = 5. If Ao < —1, 
then Bo = —1, Bı = 1, B2 = —Ao — 2, B3 = 1, Ba = Air — 1, Bs = Ao, Be = As, 
B7 = Aa, m = 7. [Actually, the last three cases involve eight subcases; if any of the B’s 
is set to zero, the values should be “collapsed together” by using the rule of exercise 
9(c). For example, if Ao = —1 and A; = A3 = 1, we actually have Bo = —(A2 + 2), 
Bı = As +1, m= 1. Double collapsing occurs when Ao = —2 and A; = 1.] 
11. Let dn = Kr(Ai, TE ,An), dn = K,„a(Bı, e.. Bn), Pn = n+1(Ao, TEF An), Ph = 
Kn+1ı(Bo,..., Bn). By (5) and (11) we have X = (pm + Pm-1Xm)/(qm + qm-1Xm), 
Y = (ph + pPh-1Yn)/(dh + dh-1Yn); therefore if Xm = Yn, the stated relation between 
X and Y holds by (8). Conversely, if X = (qY + r)/(sY + t) and |qt — rs| = 1, 
we may assume that s > 0, and we can show that the partial quotients of X and Y 
eventually agree, by induction on s. The result is clear when s = 0, by exercise 9(d). 
If s > 0, let q = as + s', where 0 < s’ < s. Then X =a+4+1/((sY + t)/(s'Y +r — at)); 
since s(r — at) — ts’ = sr — tq, and s’ < s, we know by induction and exercise 10 
that the partial quotients of X and Y eventually agree. |J. de Math. Pures et Appl. 
15 (1850), 153-155. The fact that m is always odd in exercise 10 shows, by a close 
inspection of this proof, that Xm = Yn if and only if X = (qY + r)/(sY + t), where 
qt — rs = (—1)™7”.] 


X= 
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12. (a) Since V,V,41 = D — UŽ, we know that D—U?,, is a multiple of V;,41; hence 
by induction Xn = (VD —Un)/Vn, where Un and Vn are integers. [Notes: An algorithm 
based on this process has many applications to the solution of quadratic equations in 
integers; see, for example, H. Davenport, The Higher Arithmetic (London: Hutchinson, 
1952); W. J. LeVeque, Topics in Number Theory (Reading, Mass.: Addison-Wesley, 
1956); and see also Section 4.5.4. By exercise 1.2.4-35, we have 


(l 
Åra = 
t m 


hence such an algorithm need only work with the positive integer |V D]. Moreover, 
the identity Vn41 = An(Un-1 — Un) + Vn—1 makes it unnecessary to divide when Vp+1 
is being determined.] 

(b) Let Y = (-VD—U)/V, Yn = (—V D —Un)/Vn. The stated identity obviously 
holds by replacing VD by —VD in the proof of (a). We have 


Y= (Pn/Yn + pn—1)/(Qn/ Yn T Qn-1); 


where pn and qn are defined in part (c) of this exercise; hence 


Yn = (—qn/dn—1)(Y — Pn/dn)/(Y — Pn—1/Gn-1)- 


But by (12), pn—1/qn—1 and pn/qn are extremely close to X; since X # Y, Y — pn/qn 
and Y — pn-i/qn-1 will have the same sign as Y — X for all large n. This proves that 
Yn < 0 for all large n; hence 0 < Xn < Xn — Yn = 2V D/Vn; Vn must be positive. Also 
Un < VD, since X, > 0. Hence V, < 2V D, since Va < AnVn < VD + Un-1.- 

Finally, we want to show that U» > 0. Since Xn < 1, we have Un > VD = Vn, SO 
we need only consider the case Va > VD. Then Un = AnVn — Un-1 > Vn — Un-1 > 
VD — U,-1, and this is positive as we have already observed. 

Notes: In the repeating cycle, VD + Un = AnVn 4 (VD Un-1) > Vn; hence 
(VD + Un41)/Vn4i) = [Anti + Vn/(VD + Un)| = Anti = [(VD + Un)/Vn+1]- 
In other words An+ı is determined by Un+i and V,+1; we can determine (Un, Vn) 
from its successor (Un+41,Vn+1) in the period. In fact, when 0 < Va < VD + Un 
and 0 < Un < VD, the arguments above prove that 0 < Vn+ı < JD + Un+i and 
0 < Unsi < VD; moreover, if the pair (Un41,Vn41) follows (U’,V’) with 0 < V’ < 
VD +U' and 0 < U’ < VD, then U’ = Un and V’ = Vn. Hence (Un, Vn) is part of the 
cycle if and only if 0 < Vn < VD + Un and 0 < Un < VD. 


Vai yy (an X = pn)(anY — Pn) 
Vn (qn—-1X — Pn—1)(qn—1Y — Pn-1) 


There is also a companion identity, namely 


VPnPn—1 + U (Pafn-i + Pn—19n) aly ((U? = D)/V)4n4n-1 = (-1)"U,. 


T Un) / Vasa; if Vn+1 > 0, 
+1 + Un)/Vn+1], if Vn+1 < 0; 


a3 


(c) 


(d) If Xn = Xm for some n Æ m, then X is an irrational number that satisfies 
the quadratic equation (qnX — pn)/(qn—-1X — Pn—-1) = (qmX — pm)/(qm-1X — pm-1). 

The ideas underlying this exercise go back at least to Jayadeva in India, prior to 
A.D. 1073; see K. S. Shukla, Ganita 5 (1954), 1-20; C.-O. Selenius, Historia Math. 2 
(1975), 167-184. Some of its aspects had also been discovered in Japan before 1750; 
see Y. Mikami, The Development of Mathematics in China and Japan (1913), 223-229. 
But the main principles of the theory of continued fractions for quadratics are largely 
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due to Euler [Novi Comment. Acad. Sci. Petrop. 11 (1765), 28-66] and Lagrange [Hist. 
Acad. Sci. 24 (Berlin: 1768), 111-180]. 


14. As in exercise 9, we need only verify the stated identities when c is the last 
partial quotient, and this verification is trivial. Now Hurwitz’s rule gives 2/e = 
//1,2,1,2,0,1,1,1,1,1,0,2,3,2,0,1,1,3,1,1,0,2,5,...//7.. Taking the reciprocal, col- 
lapsing out the zeros as in exercise 9, and taking note of the pattern that appears, 
we find (see exercise 16) that e/2 = 1 + //2, 2m+1, 3,1, 2m+1,1,3//, m > 0. 
[Schriften der phys.-6kon. Gesellschaft zu Königsberg 32 (1891), 59-62. Hurwitz also 
explained how to multiply by an arbitrary positive integer, in Vierteljahrsschrift der 
Naturforschenden Gesellschaft in Zürich 41 (1896), Jubelband II, 34-64, §2.] 


15. (This procedure maintains four integers (A, B,C, D) with the invariant meaning 
that “our remaining job is to output the continued fraction for (Ay+B)/(Cy+D), where 
y is the input yet to come.”) Initially set j + k + 0, (A,B,C, D) + (a,b, c,d); then 
input x; and set (A, B,C, D) + (Azvj;+B, A, Cr;+D, C), j +} j+1, one or more times 
until C + D has the same sign as C. (When j > 1 and the input has not terminated, 
we know that 1 < y < œ; and when C + D has the same sign as C we know therefore 
that (Ay + B)/(Cy + D) lies between (A + B)/(C + D) and A/C.) Now comes the 
general step: If no integer lies strictly between (A+ B)/(C+D) and A/C, output Xp + 
min(|A/C], |(A+ B)/(C + D)]), and set (A, B,C, D) + (C, D, A— XC, B— Xk D), 
k + k +1; otherwise input x; and set (A, B,C, D) + (Ax; + B, A, Cx; + D, C), 
j + j +1. The general step is repeated ad infinitum. However, if at any time the 
final x; is input, the algorithm immediately switches gears: It outputs the continued 
fraction for (Az; + B)/(Cz; + D), using Euclid’s algorithm, and terminates. 

The following tableau solves the requested example, where the matrix f= 2) 
begins at the upper left corner, then shifts right one on input, down one on output: 


Tj —1 5 1 1 1 2 1 2 ee) 

Xk 39 97 —58 —193 
—2 | —25 —62 37 123 

2 16 53 

3 5 17 22 

7 1 2 3 5 

1 3 1 4 5 14 

1 2, 1 3 7 

1 2 7 9 25 
12 1 0 1 2 

2 1 
oo 0 


M. Mendès France has shown that the number of quotients output per quotient input 
is asymptotically bounded between 1/r and r, where r = 2| L(\ad — bc|)/2| + 1 and 
L is the function defined in exercise 38; this bound is best possible. [Topics in Number 
Theory, edited by P. Turán, Colloquia Math. Soc. János Bolyai 13 (1976), 183-194.] 

Gosper has also shown that the algorithm above can be generalized to compute 
the continued fraction for (azy + bx + cy + d)/(Avy + Bx + Cy + D) from those of 
x and y (in particular, to compute sums and products). [MIT AI Laboratory Memo 
239 (29 February 1972), Hack 101.] For further developments, see J. Vuillemin, ACM 
Conf. LISP and Functional Programming 5 (1988), 14-27. 
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16. It is not difficult to prove by induction that f,(z) = z/(2n +1) + O(z3) is an odd 
function with a convergent power series in a neighborhood of the origin, and that it 
satisfies the given differential equation. Hence 


fo(z) = 2" + AOM = = 27, 382%, (n+ 2 + fatal). 


It remains to prove that limno //z7',3271,...,(2n + 1)z71// = fo(z). [Actually 
Euler, age 24, obtained continued fraction expansions for the considerably more general 
differential equation f/(z) = az™ + bfn(z)z™~* + cfn(z)?; but he did not bother to 
prove convergence, since formal manipulation and intuition were good enough in the 
eighteenth century.] 

There are several ways to prove the desired limiting equation. First, letting fn(z) = 
we Anpz", we can argue from the equation 


(2n + 1)an1 + (2n + 3)an3z7 + (2n+5)ansz* e+» = 1 (anız 4 an32°? + ansz? 4 se 


that (—1)"an(2x+1) is a sum of terms of the form c;,/(2n+1)**1(2n+bp1) -.. (2n+ dex), 
where the cx and bkm are positive integers independent of n. For example, we have 
an7 = 4/(2n + 1)4(2n + 3)(2n + 5)(2n + 7) + 1/(2n + 1)4(2n + 3)?(2n +7). Thus 
|a~n41)k| < langl, and |fn(z)| < tan |z| for |z| < 2/2. This uniform bound on f(z) 
makes the convergence proof very simple. Careful study of this argument reveals that 
the power series for f,(z) actually converges for |z| < m/2n+1/2; therefore the 
singularities of fn(z) get farther and farther away from the origin as n grows, and 
the continued fraction actually represents tanh z throughout the complex plane. 
Another proof gives further information of a different kind: If we let 


(n+k)lzr—* i 
= = z oFo(n4+1,—-n;;—-1/z 
D eran ja 


T 2n- k\ 2" 
An(z) =n! 5 ( 7 ) ri 
k=0 ` 


then 


7 (n+k—1)!((4n4+2)k4+ (n+1—k)(n—k)) pair 
Ansa(2) =), k(n +1—b)! í 
k>0 
= (4n + 2) An (2) + 2? An-1(2). 
It follows, by induction, that 


Ka(5 3 ee + An(—2z) 


z? z sey z Qn+1yn 7 

ic 2n — =) An(2z) — An(—2z) 

Knal eei =" f 
z z Qrrign 


Hence 
[23 Qn = Dey = Et, 


and we want to show that this ratio approaches tanh z. By Equations 1.2.9-(11) and 
1.2.6-(24), 


Anl) =n! Y = Bo (> ")-o") = 5 Fa ai i 


m>0 k=0 m>0 
Hence ( SE 
Z An An = Ra = 1 n,2n+1 n+ 1% ; 
pA ®) (2) = Ra(z) = (“rz DT (Qn+k+1)!k 


k>0 
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now have (e? — 1)(An (22) + An(—2z)) — (e?* + 1)(An(2z) — An (—22)) = 2R,, (22); 
2Rn(2z) 
(2z) + An(—2z)) (e? + 1)’ 


tanh z — //z',3z',...,Qn—l1)z7// = (A. 


and we have an exact formula for the difference. When |2z| < 1, the factor e°% +1 is 
bounded away from zero, |Rn(2z)| < en!/(2n + 1)!, and 


acs aa ((2) (54) - (5 4)- (4) ——) 


és Co (1 1 1 1 -) 2 (2n)! 


n! 4 16 64 = 3 nl 


Thus convergence is very rapid, even for complex values of z. 

To go from this continued fraction to the continued fraction for e7, we have 
tanhz = 1 — 2/(e?* + 1); hence we get the continued-fraction representation for 
(e? + 1)/2 by simple manipulations. Hurwitz’s rule gives the expansion of e°” + 1, 
from which we may subtract unity. For n odd, 


gz / 1, 3mn + |n/2], (12m + 6)n, (8m + 2)n + |n/2], 1/, m > 0. 


Another derivation has been given by C. S. Davis, J. London Math. Soc. 20 (1945), 
194-198. The continued fraction for e was first found empirically by Roger Cotes, 
Philosophical Transactions 29 (1714), 5-45, Proposition 1, Scholium 3. Euler com- 
municated his results in a letter to Goldbach on November 25, 1731 [Correspondance 
Mathématique et Physique, edited by P. H. Fuss, 1 (St. Petersburg: 1843), 56-60], and 
he eventually published fuller descriptions in Commentarii Acad. Sci. Petropolitanæ 9 
(1737), 98-137; 11 (1739), 32-81. 

17. (b) /xı — 1, 1, z2 — 2, 1, 73 — 2, 1, ..., 1, v2n-1 — 2, 1, van — 1 //. [Note: One can 
remove negative parameters from continuants by using the identity 


Km+n+1(21, -3m T, Yn;.. y1) 


= (1) Km4n42(21;.--,Em—-1, Em 1,1,x 1, Yn. -3 yı), 
from which we obtain 


Km+n+1(21, -Lm “T, Yn; yı) 


= —Kmint3(@1,-..,£m—-1, 2m ll,x 2,1, Yn 1, Yna1y---, Yi) 


after a second application. A similar identity appears in exercise 41.] 
(e) 14 //1,1,3,1,5,1,...//=1+ /2m+1,1//, m>0. 

18. Since we have Km(a1,a2,..., am) //a1,@2,..-,@m,2// = Km-i(a2,...,@m) + 
(-1)"/(Km-1(a1,---;@m—1) + Km (ai, @2,---,@m)2x) by Eqs. (5) and (8), we also have 
Km(a1,a2,..., am) //@1, A2,.--,@m,21, 41,02, . : , Am, £2, A1,42,.--,Am,%3, a1,...// = 
Km-1(@2,.--,@m) + //(-1)"(C + Azı), C + Axe, (—1)” (C + Ax3),... //, where A = 
Km(a1,@2,...,@m) and C = Km-1(a2,..., am) + Km-1(a1,...,@m-—1). Consequently 
the stated difference is (Km_—1(a2,---;@m)—Km-1(a1,---;@m—1))/Km(a1, @2,---,@m); 
by (6). [The case m = 2 was discussed by Euler in Commentarii Acad. Sci. Petropoli- 
tanze 9 (1737), 98-137, §24-26.] 


19. The sum for 1 < k < N is log,((1+2)(N+1)/(N+1+4+2)). 
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20. Let H = SG, g(x) = (14+ 2)G' (x), h(x = (1 oe Then (37) implies that 
h(a + 1)/(@ +2) — h(z)/(z +1) = -(1 = g(1/(1 + 2))/A+1/+2)). 

21. p(x) = c/(cex +1)? + (2—0c)/((e— lz 1)? : mae = 1/(a+c)*. When c < 1, the 
minimum of y(x)/Up(x) occurs at x = 0 ne is 2c? < 2. When c > ¢, the minimum 
occurs at x = 1 and is < $?. When c © 1.31266 the values at x = 0 and x = 1 are 
nearly equal and the minimum is > 3.2; the bounds (0.29)"y < U” < (0.31)"y are 
obtained. Still better bounds come from well-chosen linear combinations of the form 
T g(a) = dl az/(x + c3). 

23. By the interpolation formula of exercise 4.6.4-15 with zo = 0, 71 = £, £2 = T + €, 
letting € — 0, we have the general identity R} (x) = (Rn(x)— Rn (0))/£ + £R (0a (x)) 
for some 0,,(a) between 0 and x, whenever Rn is a function with continuous second 
derivative. Hence in this case R/,(x) = O(2~"). 


24. oo. [A. Khinchin, in Compos. Math. 1 (1935), 361-382, proved that the sum 
Ai +--+ An of the first n partial quotients of a real number X will be asymptotically 
nlgn, for almost all X. Exercise 35 shows that the behavior is different for rational X.] 


25. Any union of intervals can be written as a union of disjoint intervals, since we have 
Ups1 Te = Uns Ue \ Ui<j<k I;), and this is a disjoint union in which J \ Ur<jen Tj 


can be expressed as a finite union of disjoint intervals. Therefore we may take Z = (J Ix, 
where J, is an interval of length «/2* containing the kth rational number in [0.. 1], 
using some enumeration of the rationals. In this case u(Z) < e, but |Z N Pa| = n for 
all n. 


26. The continued fractions //Ai,...,A:// that appear are precisely those for which 
Ai >1, At >1, and K¢(Ai, Ao,..., At) is a divisor of n. Therefore (6) completes 
the proof. [Note: If mi/n = //A1,...,At// and m2/n = //At,...,A1//, where mi 
and mz are relatively prime to n, then mımə = +1 (modulo n); this rule defines the 
correspondence. When A; = 1 an analogous symmetry is valid, according to (46).] 


27. First prove the result for n = pf, then for n = rs, where r and s are relatively 
prime. Alternatively, use the formulas in the next exercise. 


28. (a) The left-hand side is multiplicative (see exercise 1.2.4—31), and it is easily 
evaluated when n is a power of a prime. (c) From (a), we have Mébius’s inversion 


formula: If f(n) = Yan 9(@), then g(n) = Y ayn H(n/d) f(d). 

29. We have X^; nlnn = +N? InN + O(N’) by Euler’s summation formula (see 
exercise 1.2.11.2-7). Also JL; nJ ann A(@)/d = ™_, A(d) J i<k<nN/a K and this is 
O( aan A(d).N?/d?) = O(.N?). Indeed, X g>; A(d)/d? = —¢'(2)/¢(2). 

30. The modified algorithm affects the calculation if and only if the following division 
step in the unmodified algorithm would have the quotient 1, and in this case it avoids 
the following division step. The probability that a given division step is avoided is 
the probability that A, = 1 and that this quotient is preceded by an even number of 
quotients equal to 1. By the symmetry condition, this is the probability that A, = 1 
and is followed by an even number of quotients equal to 1. The latter happens if and 
only if X,-1 > 6-1 =0.618..., where ¢ is the golden ratio: For Ay = 1 and Ak+ı > 1 
if and only if 2 < Xk-1 < 1; Ak = Anti = Ák+2 = l and Apis > 1 if and only if 2 < 
Xk-i < 3; etc. Thus we save approximately F,_1(1) — Fy_-1(¢— 1) ~ 1 —lg¢ ~ 0.306 
of the division steps. The average number of steps is approximately ((121n ¢)/m?) nn, 
when v = n and u is relatively prime to n. 
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K. Vahlen [Crelle 115 (1895), 221-233] considered all algorithms that replace 
(u,v) by (v, (4u) mod v) at each iteration when umodv 4 0. If u L v there are 
exactly v such algorithms, and they can be represented as a binary tree with v leaves. 
The shallowest leaves, which correspond to the shortest possible number of iterations 
over all such gcd algorithms, occur when the least remainder is taken at each step; the 
deepest leaves occur when the greatest remainder is always chosen. [Similar ideas had 
been considered by Lagrange in Hist. Acad. Sci. 23 (Berlin: 1768), 111-180, §58.] For 
further results see N. G. de Bruijn and W. M. Zaring, Nieuw Archief voor Wiskunde 
(3) 1 (1953), 105-112; G. J. Rieger, Math. Nachr. 82 (1978), 157-180. 

On many computers, the modified algorithm makes each division step longer; the 

idea of exercise 1, which saves all division steps when the quotient is unity, would be 
preferable in such cases. 
31. Let ao = 0, a1 = 1, Qn41 = 2an + an-ı; then an = (a } v2)” (1 V2)”)/2V2, 
and the worst case (in the sense of Theorem F) occurs when u = an + an—1, V = Gn, 
n > 2. This result is due to A. Dupré [J. de Math. 11 (1846), 41-64], who also 
investigated more general “look-ahead” procedures suggested by J. Binet. 


32. (b) Km-1(21,...,£m-1)Kn—-1(£m+2,.. -, £m+n) corresponds to those Morse code 
sequences of length m + n in which a dash occupies positions m and m + 1; the other 
term corresponds to the opposite case. (Alternatively, use exercise 2. The more general 
identity 


Kmtn(21, ae eee »Lm+n)Ke(tm4i; Pee ,Lm+k) = 
Km+k(£1,..., Em4k)Kn(Em+1, .--,Em+n) 
+ (—1)"Km-1(21,.-.,Em iKack-il imik e tmin) 


also appeared in Euler’s paper. Incidentally, “Morse code” was really invented by F. C. 
Gerke in 1848; Morse’s prototypes were quite different.) 

33. (a) The new representations are x = m/d, y = (n — m)/d, x£ = y' = d = 
gcd(m, n — m), for ån < m < n. (b) The relation (n/x')—y < x < n/x' defines z. 
(c) Count the x’ satisfying (b). (d) A pair of integers x > y > 0 with x L y can 
be uniquely written in the form x = Km(£1,...,£m), Y = Km-1(21,...,£m-1), 
where x; > 2 and m > 1; here y/£z = //tm,...,21//. (e) It suffices to show that 
J i<rcny2 T(k, n) = 2|n/2] + h(n); this follows from exercise 26. 

34. (a) Dividing x and y by gcd(z, y) yields g(n) = X an h(n/d); apply exercise 28(c), 
and use the symmetry between primed and unprimed variables. (b) For fixed y and t, 
the representations with rd > 2’ have z’ < vnd; hence there are O(vnd/y) such 
representations. Now sum for 0 < t < y < \/n/d. (c) If s(y) is the given sum, then 
Zay 8(4) = y(Hay — Hy) = k(y), say; hence s(y) = diay u(@)K(y/d). Now k(y) = 
yln2 — 1/4 + O(1/y). (A) DX, p0) = OM, Lay Md) /vd = Degen (a) /ed?. 
(Similarly, 77, a-1(y)/y” = O(1).) (©) Eg wk) /k? = 6/r? + O(1/n) (see exercise 
4.5.2-10(d)); and Xg; u(k) log k/k? = O(1). Hence ha(n) = n((3ln 2)/7?) In(n/d) + 
O(n) for d > 1. Finally h(n) = 2) ayn M(d)he(n/ed) = ((6In 2) /m?)n(Inn—S> — 0’) + 
O(no-1(n)*), where the remaining sums are X = Jean Hd) In(ed)/ed = 0 and w= 
Sean Uld) Ine/ed = X an A(d)/d. [It is well known that o_1(n) = O(log logn); see 
Hardy and Wright, An Introduction to the Theory of Numbers, §22.9.] 

35. See Proc. Nat. Acad. Sci. 72 (1975), 4720-4722. M. L. V. Pitteway and C. M. A. 
Castle [Bull. Inst. Math. and Its Applications 24 (1988), 17-20] have found strong and 
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tantalizing empirical evidence that the sum of all partial quotients is actually 


n? 1 18m2? \ 6 4r p+1p™—-1 2 
24(In 2)? ( 2 n? ) -D (2 gr pol ) (In p) 


p prime 
p"\n 


— 2.542875 + O(n~*/?), 


36. Working the algorithm backwards, assuming that t — 1 divisions occur in step C2 


for a given value of k, we obtain minimum un when ged(uk+1,..., Un) = Fi, ..- Fy, 
and up = Fi, ... Fi,_, Fi,-1 (modulo gcd(urqi,-. .,Un)); here the ts are > 2, ti > 3, 
and tı +--+: +tn-1 = N+n-—1. One way to minimize un = Fi, ...F;,,_, under these 
conditions is to take tı = 3, t2 = --- = tn—-2 = 2, Un = 2Fy-n+2. If we stipulate also 


that ui > u2 > --- > un, the solution ui = 2Fy-n+3 + 1, ug = +++ = Un-1 = 2FN-—ni3, 
Un = 2FN—n+2 has minimum u1. [See CACM 13 (1970), 433-436, 447-448. ] 


37. See Proc. Amer. Math. Soc. 7 (1956), 1014-1021; see also exercise 6.1-18. 


38. Let m = [n/], so that m/n = @ | + e= //a1,a2,... // where 0 < € < 1/n. Let k 
be minimal such that ap > 2; then (¢'~* + (—1)* Fy_1€)/(~* — (—1)" Fre) > 2, hence 
k is even and ¢-? =2—¢ < ¢*Fyyse = (t? — b*)e/V5. [Ann. Polon. Math. 1 
(1954), 203-206.] 


39. At least 287 at bats; //2,1,95// = 96/287 = .33449477, and no fraction with 
denominator < 287 lies in the interval 


[.3335 .. 3345] = [//2,1,666// .. //2,1,94,1,1,3//]. 


To solve the general question of the fraction in [a..b] with smallest denominator, 
where 0 <a < b < 1, note that in terms of regular continued-fraction representations 
we have //x1,22,...// < //y1,y2,---// if and only if (—1)’a; < (—1)’y; for the smallest 
j with z; # yj, where we place “oo” after the last partial quotient of a rational 
number. Thus if a = //x1,%2,...// and b = //y1,y2,...//, and if j is minimal with 
xj # yj, the fractions in [a..b] have the form c = //x1,...,%j-1, 2j,---, Zm// where 
| 2j,+++,2%m// lies between //£j, £j+1,... // and //y;, yj+1,.-.// inclusive. Let K_1 = 0. 
The denominator 


Kj-1(@1,...,%j-1) Km—j41(2j, +++, 2m) + Kj-2(a1,.-.,€j-2) Km—j(2j41--+,%m) 


of c is minimized when m = j and zj = (j odd > yj + [yj41 400]; xj + [£j+1 Æ ]). 
[Another way to derive this method comes from the theory in the following exercise.] 


40. One can prove by induction that p,q: — pidr = 1 at each node, hence p; and qı 
are relatively prime. Since p/q < p’/q’ implies that p/q < (p + p’)/(q+ 7) < p'/d, it 
is also clear that the labels on all left descendants of p/q are less than p/q, while the 
labels on all its right descendants are greater. Therefore each rational number occurs 
at most once as a label. 

It remains to show that each rational does appear. If p/q = //a1,...,ar,1//, where 
each a; is a positive integer, one can show by induction that the node labeled p/q is 
found by going left ai times, then right a2 times, then left a3 times, etc. 

[The sequence of labels on successive levels of this tree was first studied by M. A. 
Stern, Crelle 55 (1858), 193-220, although the relation to binary trees is not explicit in 
his paper. The notion of obtaining all possible fractions by successively interpolating 
(p + p')/(q + q’) between adjacent elements p/q and p’/q' goes back much further: 
The essential ideas were published by Daniel Schwenter [Deliciae Physico-Mathematicee 
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(Nürnberg: 1636), Part 1, Problem 87; Geometria Practica, 3rd edition (1641), 68; 
see M. Cantor, Geschichte der Math. 2 (1900), 763-765], and by John Wallis in his 
Treatise of Algebra (1685), Chapters 10-11. C. Huygens put such ideas to good use 
when designing the gear-wheels of his planetarium [see Descriptio Automati Planetarii 
(1703), published after his death]. Lagrange gave a full description in Hist. Acad. Sci. 
23 (Berlin: 1767), 311-352, §24, and in his additions to the French translation of Euler’s 
algebra (1774), §18-§20. See also exercise 1.3.2-19; A. Brocot, Revue Chronométrique 
3 (1861), 186-194; D. H. Lehmer, AMM 36 (1929), 59-67] 


41. In fact, the regular continued fractions for numbers of the general form 


tte ea 
hi Phe ` ABl ` 


have an interesting pattern, based on the continuant identity 
Km+n+1(21, .--; m1; m — 1, 1, Yn = 1,Yn-1, PG yı) = 
LmKm-1(#1,.--;Lm—1) Kn(yn,---,y1) 
gg (-1)"Kimtn(21, cae ,Lm-—1,0, Yn, —Yn-15-++5 =y). 


This identity is most interesting when yn = %m-1, Yn—1 = Lm—2, etc., since 


Kn41(21,;. : -3 Zk; 0; Žk41;" : agra) = Kn-1(21, aang SEAN SRP SL 5 SRD 5 3 oo ae 


In particular we find that if pn /qn = Kn-1(@2,..-,2n)/Kn(a1,..-,2n) = //v1,...,En//, 
then pn/dn + (—1)"/a2r = z1, ..., En, r — 1, l, &n — l, ¢n-1,...,21//. By changing 
[£1,.--,En// to /£1,...,En—-1;,£n — 1,1 //, we can control the sign (—1)” as desired. 

For example, the partial sums of the first series have the following continued frac- 
tions of even length: //1,1//; //1,1,1,1,0,1// = /1,1,1,2/; //1,1,1,2,1,1,1,1,1,1/; 
WAG D 2 Ud VA 1,1;0,1,1;1,1;1,2,1;1;1/ = //14;1;1,2,1,1.1,1,1,1,1,2,1,1,1, 
1,2,1,1,1//; and from this point on the sequence settles down and obeys a simple 
reflecting pattern. We find that the nth partial quotient an can be computed rapidly 
as follows, if n — 1 = 20q + r where 0 < r < 20: 


1, if r = 0, 2,4,5,6, 7,9, 10, 12, 13, 14,15, 17, or 19; 
2, if r = 3 or 16; 
an =< 1+(q+r)mod2, if r=8 or 11; 
2— dq, ip = 1; 
1+ daza, ifr = 18. 


Here dn is the “dragon sequence” defined by the rules do = 1, dan = dn, dan+i = 0, 
dan+3 = 1; the Jacobi symbol (=) is 1 — 2d,,. The dragon curve discussed in exercise 
4.1-18 turns right at its nth step if and only if dn = 1. 

Liouville’s numbers with J > 3 are equal to //l—1, 1+1, l? —1, 1, l, l—1, 1? —1, 1, 
1—2,1,1,?-1,141,1-1, 1% —1,...//. The nth partial quotient an depends on the 
dragon sequence on n mod 4 as follows: If n mod 4 = 1 it is !—2+dn_14+(|n/2| mod 4) 
and if n mod 4 = 2 it is 1+2—dn42—(|n/2| mod 4); if n mod 4 = 0 it is 1 or 4'70 —1, 
depending on whether or not dn = 0 or 1, where k is the largest power of 2 dividing n; 
and if n mod 4 = 3 it is [*'*-) —1 or 1, depending on whether d,+1 = 0 or 1, where k 
is the largest power of 2 dividing n + 1. When l = 2 the same rules apply, except that 
Os must be removed, so there is a more complicated pattern depending on n mod 24. 
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[References: J. O. Shallit, J. Number Theory 11 (1979), 209-217; Allouche, Lubiw, 
Mendès France, van der Poorten, and Shallit, Acta Arithmetica 77 (1996), 77-96.] 


42. Suppose that ||¢X|| = |q¢X — p|. We can always find integers u and v such 
that q = ugn-1 + vgn and p = upn—-1 + Vpn, where pn = Kn-1i(A2,...,An), since 
GnPn—1 — qn—-1Pn = +1. The result is clear if v = 0. Otherwise we must have uv < 0, 


hence u(qn-1X — pn—i) has the same sign as v(qnX — pn), and |gX — p| is equal 
to |u| |@n—1X — pn—1| + |v||qnX — pn|. This completes the proof, since u # 0. See 
Theorem 6.45 for a generalization. 

43. If x is representable, so is the parent of x in the Stern—Brocot tree of exercise 40; 
thus the representable numbers form a subtree of that binary tree. Let (u/u’) and 
(v/v') be adjacent representable numbers. Then one is an ancestor of the other; say 
(u/u’) is an ancestor of (v/v’), since the other case is similar. Then (u/u’) is the nearest 
left ancestor of (v/v’), so all numbers between u/u’ and v/v’ are left descendants of 
(v/v’) and the mediant ((u + v)/(u’ + ’)) is its left child. According to the relation 
between regular continued fractions and the binary tree, the mediant and all of its left 
descendants will have (u/u’) as their last representable p;/qi, while all of the mediant’s 
right descendants will have (v/v’) as one of the pi/qi. (The numbers p;/qi label the 
parents of the “turning-point” nodes on the path to zx.) 


44. A counterexample for M = N = 100 is (u/u’) = 4, (v/v') = &. However, the 
identity is almost always true, because of (12); it fails only when u/u’ + v/v’ is very 


nearly equal to a fraction that is simpler than (u/u’). 


45. To determine A andr such that u = Av+r,0 < r < v, using ordinary long division, 
takes O((1+ log A)(log u)) units of time. If the quotients during the algorithm are Ai, 
A2, ..., Am, then Ai A2...Am < u, so log Ai+---+log Am < logu. Also m = O(log u) 
by Corollary L. 

46. Yes, to O(n(logn)?(loglogn)), even if we also need to compute the sequence of 
partial quotients that would be computed by Euclid’s algorithm; see A. Schönhage, Acta 
Informatica 1 (1971), 139-144. Moreover, Schénhage’s algorithm is asymptotically op- 
timal for computing a continued fraction expansion, with respect to the multiplications 
and divisions it performs [V. Strassen, SICOMP 12 (1983), 1-27]. Algorithm 4.5.2L is 
better in practice unless n is quite large, but an efficient implementation for numbers 
exceeding about 1800 bits is sketched in the book Fast Algorithms by A. Schonhage, 
A. F. W. Grotefeld, and E. Vetter (Heidelberg: Spektrum Akademischer Verlag, 1994), 
87.2. 


48. T} = (Kj-2(—a2,... , —aj-1), Kj-1 —a1,...,—@;-1), Kn—;(aj41,---,;@n)d) = 
((—1) K;-2(a2,...,aj-1), (1) 71 K;j-1(a1,...,aj-1), Kn—j(aj4i,---,@n)d). 

49. Since Azı + p21 = pu and A£n+1 + HZn+1 = —Av/d, there is an odd value 
of j such that Ax; + uz; > 0 and Azj+2 + uzj+2 < 0. If Av; + pz; > 0 and 


A£j+2 + UZj+2 < —0 we have u > O/z; and A > —0/£j+2. It follows that 0 < 
ÀATj+1 T HZj+1 < Apxj4+125/0 = Apezj4125 42/0 < 2Auv/0 = 20, because we have 
|£k+12k| = Kk-1(a2,...,ak)Kn-k(äk+1;---,an) < Kn-1(a2,...,an) = v/d for all k. 
[H. W. Lenstra, Jr., Math. Comp. 42 (1984), 331-340.] 
50. Let k = [8/a]. If ka < y, the answer is k; otherwise it is 

koii ee mod ne Uae), 


Q 


51. If ax — mz = y and x L y we have x L mz. Consider the Stern-Brocot tree of 
exercise 40, with an additional node labeled 0/1. Attach the tag value y = ax — mz 
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together with each node label z/x. We want to find all nodes z/x whose tag y is at most 
0 = Vm/2 in absolute value and whose denominator x is also < 6. The only possible 
path to such nodes keeps a positive tag to the left and a negative tag to the right. This 
rule defines a unique path, which moves to the right when the tag is positive and to the 
left when the tag is negative, stopping when the tag becomes zero. The same path is 
followed implicitly when Algorithm 4.5.2X is performed with u = m and v = a, except 
that the algorithm skips ahead —it visits only nodes of the path just before the tag 
changes sign (the parents of the “turning point” nodes as in exercise 43). 

Let z/x be the first node of the path whose tag y satisfies |y| < 0. If x > 0, there 
is no solution, since subsequent values on the path have even larger denominators. 
Otherwise (+2, Fy) is a solution, provided that x L y. 

It is easy to see that there is no solution if y = 0, and that if y # 0 the tag 
on the next node of the path will not have the same sign as y. Therefore node z/x 
will be visited by Algorithm 4.5.2X, and we will have x = x; = Kj_1(a1,...,a@;-1), 
y = y; = (-1L)9-Y Kaj (aj4i,.-.,4n)d, 2 = z; = K;-2(a2,...,a;-1) for some j 
(see exercise 48). The next possibility for a solution will be the node labeled z'/2' = 
(zj-1+k2;)/(aj-1+ka;) with tag y’ = yj;-1 + kyj, where k is as small as possible such 
that |y’| < 6; we have y’y < 0. However, x’ must now exceed 0; otherwise we would 
have m = Kn(a1,...,an)d = 2'|y| + 2|y'| < 6? + 6? = m, and equality cannot hold. 

This discussion proves that the problem can be solved efficiently by applying 
Algorithm 4.5.2X with u = m and v = a, but with the following replacement for 
step X2: “If v3 < Vm/2, the algorithm terminates. The pair (x, y) = (|v2|, v3 sign(v2)) 
is then the unique solution, provided that x L y and x < Wm/2; otherwise there is no 
solution.” [P. S. Wang, Lecture Notes in Comp. Sci. 162 (1983), 225-235; P. Kornerup 
and R. T. Gregory, BIT 23 (1983), 9-20.] 

A similar method works if we require 0 < x < 0; and |y| < 02, whenever 20102 < m. 


SECTION 4.5.4 
1. If d isn’t prime, its prime factors are cast out before dx is tried. 
2. No; the algorithm would fail if px_1 = pt, giving “1” as a spurious prime factor. 
3. Let P be the product of the first 168 primes. [Note: Although P = 19590...5910 


is a 416-digit number, such a gcd can be computed in much less time than it would 
take to do 168 divisions, if we just want to test whether or not n is prime.] 


4. In the notation of exercise 3.1-11, 


Sy ple piy D T1( 2 =), 


ur 1>1 k=1 
where f(l) = Diycycz 2llsmax(1+1-A,A)]. TF | = 2*+° where 0 < 0 < 1, we have 
fh) =F 6-27 -24-""), 


where the function 3. 27° — 2-27? reaches a maximum of 3 at 0 = 1g(4/3) and has 
a minimum of 1 at 6 = 0 and 1. Therefore the average alee of 21s max(u+1,A)] lies 
between 1.0 and 1.125 times the average value of u + A, and the result follows. 

Notes: Richard Brent has observed that, as m — oo, the density []\_,(1—k/m) = 
exp(—I(I — 1)/2m + O(13/m?)) approaches a normal distribution, and we may assume 
that @ is uniformly distributed. Then 3-2~° — 2-2~?° takes the average value 3/(41n 2), 
and the average number of iterations needed by Algorithm B comes to approximately 
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(3/(4In2) + $)/xm/2 = 1.98277,/m. A similar analysis of the more general method 
in the answer to exercise 3.1-7 gives ~ 1.92600,/m, when p œ% 2.4771366 is chosen 
“optimally” as the root of (p? — 1) In p = p? — p+ 1. See BIT 20 (1980), 176-184. 

Algorithm B is a refinement of Pollard’s original algorithm, which was based on 
exercise 3.1-6(b) instead of the yet undiscovered result in exercise 3.1-7. He showed 
that the least n such that X2, = Xn has average value ~ (1?/12)Q(m) ~ 1.0308,/m; 
this constant 1/12 is explained by Eq. 4.5.3-(21). Hence the average amount of work 
needed by his original algorithm is about 1.03081,/m gceds (or multiplications mod m) 
and 3.09243,/m squarings. This will actually be better than Algorithm B when the 
cost of gcd is more than about 1.17 times the cost of squaring —as it usually is with 
large numbers. 

Brent noticed, however, that Algorithm B can be improved by not checking the gcd 
when k > 1/2; if step B4 is repeated until k < 1/2, we will still detect the cycle, after 
Al L()/A| = (u) — (£(u) mod A) further iterations. The average cost now becomes 
approximately (3/(4In2))Vmm/2 = 1.35611,/m iterations when we square without 
taking the gcd, plus ((In r — y)/(4In2) + $)V/7m/2 ~ .88319\/m iterations when we 
do both. [See the analysis by Henri Cohen in A Course in Computational Algebraic 
Number Theory (Berlin: Springer, 1993), §8.5.] 


5. Remarkably, 11111 = 8616460799 (modulo 3 -7 - 8- 11), so (14) is correct also for 
N = 11111 except with respect to the modulus 5. Since the residues (x? — N) mod 5 
are 4, 0, 3, 3, 0, we must have xmod5 = 0, 1, or 4. The first x > [VN] = 106 
that satisfies all the conditions is x = 144; but the square root of 1442 — 11111 = 9625 
is not an integer. The next case, however, gives 1567 — 11111 = 13225 = 1157, and 
11111 = (156 — 115) - (156 + 115) = 41 - 271. 


6. Let us count the number of solutions (x,y) of the congruence N = (x — y)(a+ y) 
(modulo p), where 0 < x,y < p. Since N Æ 0 and p is prime, x + y # 0. For each 
v Æ 0 there is a unique u (modulo p) such that N = uv. The congruences x — y = u, 
x+y =v now uniquely determine x mod p and y mod p, since p is odd. Thus the stated 
congruence has exactly p — 1 solutions (x,y). If (x,y) is a solution, so is (x,p — y) if 
y #0, since (p—y)? = y?; and if (x, yı) and (a, yz) are solutions with yı Æ y2, we have 
yi = y3; hence yı = p—y2. Thus the number of different x values among the solutions 
(x,y) is (p — 1)/2 if N = 2? has no solutions, or (p + 1)/2 if N = 2? has solutions. 


7. One procedure is to keep two indices for each modulus, one for the current word 
position and one for the current bit position; loading two words of the table and doing 
an indexed shift command will bring the table entries into proper alignment. (Many 
computers have special facilities for such bit manipulation.) 


8. (We may assume that N = 2M is even.) The following algorithm uses an auxiliary 
table X[1], X[2],..., X[M — 1], where X[k] represents the primality of 2k + 1. 


S1. Set X[k] + 1 for 1 < k < M. Also set j + 1, k + 1, p + 3, q + 4. (During 
this algorithm p = 2j + 1 and q = 2j + 23°.) 


S2. If X[j] = 0, go to S4. Otherwise output p, which is prime, and set k + q. 
S3. If k < M, then set X[k] + 0, k + k + p, and repeat this step. 
S4. Set j + j +1, p +} p+ 2, q4 q+2p-—2. If j < M, return to S2. | 


A major part of this calculation could be made noticeably faster if q (instead of j) were 
tested against M in step $4, and if a new loop were appended that outputs 2j + 1 for 
all remaining X[j] that equal 1, suppressing the manipulation of p and q. 
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Notes: The original sieve of Eratosthenes was described in Book 1, Chapter 13 of 
Nicomachus’s Introduction to Arithmetic. It is well known that `, primelP < N]/p = 
InInN + M + O((log N)~ 109°), where M = y+ ¥,5,u(k) In ¢(k)/k is Mertens’s 
constant 0.26149 72128 47642 78375 54268 38608 69585 90516—; see F. Mertens, Crelle 
76 (1874), 46-62; Greene and Knuth, Mathematics for the Analysis of Algorithms 
(Boston: Birkhauser, 1981), §4.2.3. In particular, the number of operations in the 
original algorithm described by Nicomachus is N InIn N + O(N). Improvements in the 
efficiency of sieve methods for generating primes are discussed in exercise 5.2.3-15 and 
in Section 7.1.3. 


9. If p° is a divisor of n for some prime p, then p is a divisor of A(n), but not of n—1. 
If n = pip2, where pı < p2 are primes, then p2 — 1 is a divisor of A(n) and therefore 
pip2 — 1 = 0 (modulo p2 — 1). Since p2 = 1, this means pı — 1 is a multiple of pə — 1, 
contradicting the assumption pı < p2. [Values of n for which A(n) properly divides 
n — 1 are called Carmichael numbers. For example, here are some small Carmichael 
numbers with up to six prime factors: 3-11-17, 5-13-17, 7-11-13-41, 5-7-17-19-73, 
5-7-17-73-89-107. There are 8241 Carmichael numbers less than 10!?, and there 
are at least N(N 2/ T) Carmichael numbers less than N; see W. R. Alford, A. Granville, 
and C. Pomerance, Annals of Math. (2) 139 (1994), 703-722.] 


10. Let kp be the order of xp modulo n, and let À be the least common multiple of all 
the k,’s. Then X is a divisor of n — 1 but not of any (n — 1)/p, so A = n — 1. Since 
te) modn = 1, y(n) is a multiple of kp for all p, so y(n) > A. But y(n) < n—1 
when n is not prime. (Another way to carry out the proof is to construct an element x 


of order n — 1 from the zp’s, by the method of exercise 3.2.1.2-15.) 


11, U V A P S T Output 
1984 1 0 992 0 = 
1981 1981 1 992 1 1981 
1983 4 495 993 0 1 993? = +2? 
1983 991 2 98109 1 991 
1981 4 495 2 0 1 JAS +49? 
1984 1981 1 99099 1 1981 
1984 1 1984 99101 0 1 99101? = +2° 


The factorization 199-991 is evident from the first or last outputs. The shortness of the 
cycle, and the appearance of the notorious number 1984, are probably just coincidences. 


12. The following algorithm makes use of an auxiliary (m + 1) x (m + 1) matrix of 
integers Ejk, 0 < j,k < m; a single-precision vector (bo, b1,...,6m); and a multiple- 
precision vector (%0,21,...,2%m) with entries in the range 0 < zp < N. 


F1. [Initialize.] Set b; +— —1 for 0 < i < m; then set j + 0. 


F2. [Next solution.] Get the next output (x, €o, €1,..., €m) from Algorithm E. (It 
is convenient to regard Algorithms E and F as coroutines.) Set k + m. 


F3. [Search for odd.] If k < 0 go to step F5. Otherwise if ex is even, set k + k— 1 
and repeat this step. 


F4. [Linear dependence?] If bẹ > 0, then set i < bk, x + (xix) mod N, er + 
er + Eir for 0 < r < m; set k + k— 1 and return to F3. Otherwise set bk < 7, 
zj 4+ x, Ejr 4 er for 0 < r < m; set j + j +1 and return to F2. (In the 
latter case we have a new linearly independent solution, modulo 2, whose first 
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odd component is ex. The values E;, are not guaranteed to remain single- 
precision, but they tend to remain small when k decreases from m to 0 as 
recommended by Morrison and Brillhart.) 


F5. [Try to factor.] (Now eo, e1, ..., €m are even.) Set 
y & ((=1)°/ pi... per /?) mod N. 


If x = y or if x + y = N, return to F2. Otherwise compute gcd(x — y, N), 
which is a proper factor of N, and terminate the algorithm. J] 


This algorithm finds a factor whenever it is possible to deduce one from the given 
outputs of Algorithm E. [Proof. Let the outputs of Algorithm E be (X;, Fio,..., Eim) 
for 1 < i < t, and suppose that we could find a factorization N = Nı N2 when z = 
X21... X% and y = (—1)°0/?pi/?. .. pe ? (modulo N), where ej = a1 E1j+ -+a Etj 
is even for all j. Then x = +y (modulo Nj) and x = Fy (modulo N2). It is not difficult 
to see that this solution can be transformed into a pair (x, y) that appears in step F5, 


by a series of steps that systematically replace (x,y) by (xa’, yy’) where 2’ = +y 
(modulo N).] 

13. There are 2% values of x having the same exponents me ...,€m), since we can 
choose the sign of x modulo q}* arbitrarily when N = qi}... 4. Exactly two of these 


21 values will fail to yield a factor. 


14. Since P? = kNQ? (modulo p) for any prime divisor p of V, we get 1 = P??-D/? = 
(kKNQ?)-))/? = (kN)®-Y/? (modulo p), if P # 0. 


15. Un = (a” — b”) /V D, where a = 3(P+ VD), b= 4(P -v D), D = P? —4Q. Then 
2°-1U, = Fa (E ADR so Up = pe-9/2 (modulo p) if p is an odd prime. 
Similarly, if V, = a” +b” = One — QUn-1, then 271V, = D, (3p) P7? D", and 
Vp = P? = P. Thus if Up = —1, we find that Up+ı mod p = 0. If U, = 1, we find that 
(QUp-1) mod p = 0; here if Q is a multiple of p, Un = P”~' (modulo p) for n > 0, so 
Un is never a multiple of p; if Q is not a multiple of p, Up—ı mod p = 0. Therefore as in 
Theorem L, U; mod N = 0 if N = pî... ptr, N L Q, and t = lemi<j<r(pj? *(pj+e;))- 
Under the assumptions of this exercis, the rank of apparition of N is N + 1; hence N 
is prime to Q and t is a multiple of N +1. Also, the assumptions of this exercise imply 
that each p; is odd and each e; is +1, so t < 21—"T] pii (p; + $D;) = 2(2)"N; hence 
r= 1 and t= p<! +e:p<!~'. Finally, therefore, ey = 1 and € = 1. 

Note: If this test for primality is to be any good, we must choose P and Q in 
such a way that the test will probably work. Lehmer suggests taking P = 1 so that 
D = 1 — 4Q, and choosing Q so that N L QD. (If the latter condition fails, we know 
already that N is not prime, unless |QD| > N.) Furthermore, the derivation above 
shows that we will want €; = 1, that is, D~)/? = —1 (modulo N). This is another 
condition that determines the choice of Q. Furthermore, if D satisfies this condition, 
and if Un+1 mod N Æ 0, we know that N is not prime. 

Example: If P = 1 and Q = —1, we have the Fibonacci sequence, with D = 5. 
Since 511 = —1 (modulo 23), we might attempt to prove that 23 is prime by using the 
Fibonacci sequence: 


(Fa mod 23) = 0,1, 1,2,3,5,8, 13, 21, 11, 9, 20, 6, 3, 9, 12, 21, 10, 8, 18,3, 21,1, 22,0,..., 


so 24 is the rank of apparition of 23 and the test works. However, the Fibonacci 
sequence cannot be used in this way to prove the primality of 13 or 17, since Fy mod 
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13 = 0 and Fy mod 17 = 0. When p = +1 (modulo 10), we have 5”~))/? mod p = 1, so 
Fy-1 (not Fy+1) is divisible by p. 
17. Let f(q) = 2lgq—1. When q = 2 or 3, the tree has at most f(q) nodes. When 


q > 3 is prime, let q = 1 + q1... qe where t > 2 and qi,..., q@ are prime. The size of 
the tree is < 1+ 5° f(qx) =2+ f(q—1) —t < f(q). [SICOMP 4 (1975), 214-220.] 


18. 2(G(a) — F(a)) is the number of n < x whose second-largest prime factor is < x“ 
and whose largest prime factor is > x“. Hence 


2G" (t) dt = (m(a"t**) — m(e")) - a!*(G(t/(1 — t)) — F(t/(1- 8). 


The probability that py_1 < Vpr is Ío F(t/2(1 — t))t™* dt. [Curiously, it can be shown 
that this also equals fy F(t/(1 — t)) dt, the average value of log p:/log x, and it also 
equals the Dickman-Golomb constant .62433 of exercises 1.3.3-23 and 3.1-13. The 
derivative G” (0) can be shown to equal 


fo FĖ(t/0 — t))t7? dt a a 


The third-largest prime factor has H (a) = fy (H(t/(1 — t)) — G(t/(1 — t)))t™ dt and 
H'(0) = œ. See P. Billingsley, a. ae Hungar. 2 (1972), 283-289; J. Galambos, 
Acta Arith. 31 (1976), 213-218; D. E. Knuth and L. Trabb Pardo, Theoretical Comp. 
Sci. 3 (1976), 321-348; J. L. Hafner and K. S. McCurley, J. Algorithms 10 (1989), 
531-556.] 


19. M = 2? —1 isa multiple of all p for which the order of 2 modulo p divides D. To 
extend this idea, let aj = 2 and aj41 = a, mod N, where qj = pP» pj is the jth prime, 
and ej = |log1000/log p; |; let A = ais9. Now compute by = gcd(A? — 1, N) for all 
primes q between 10° and 10°. One way to do this is to start with A'°°° mod N and 
then to multiply alternately by Af mod N and A? mod N. (A similar method was used 
in the 1920s by D. N. Lehmer, but he didn’t publish it.) As with Algorithm B we can 
avoid most of the gcds by batching; for example, since b39,_x = gcd(A®°" — A*, N), we 
might try batches of 8, computing cr = (A%°" — A?°)(A8°" — A?8) ... (A8°” — A) mod N, 
then gcd(c,, N) for 33 < r < 3334. 


20. See H. C. Williams, Math. Comp. 39 (1982), 225-234. 


21. Some interesting theory relevant to this conjecture has been introduced by Eric 
Bach, Information and Computation 90 (1991), 139-155. 


22. Algorithm P fails only when the random number x does not reveal the fact that 
n is nonprime. Say x is bad if z modn = 1 or if one of the numbers 224 is = —1 
(modulo n) for 0 < j < k. Since 1 is bad, we have pn = [n nonprime] (bn — 1)/(n—2) < 
[n nonprime] bn /(n — 1), where bn is the number of bad x such that 1 < z < n. 

Every bad x satisfies z”! = 1 (modulo n). When p is prime, the number of 
solutions to the congruence x? = 1 (modulo p°) for 1 < x < pê is the same as the 
number of solutions of gy = 0 (modulo p*'(p—1)) for 0 < y < p*'(p— 1), namely 
gcd(q, p°! (p — 1)), since we may replace x by a” where a is a primitive root. 

Let n = n{t...n&", where the n; are distinct primes. According to the Chinese 
remainder theorem, the number of solutions to the congruence «”~' = 1 (modulo n) is 
Il; gcd(n — 1, nei- (ni — 1)), and this is at most [];_;(n: — 1) since n; is relatively 
prime to n — 1. If some e; > 1, we have n; — 1 < ane hence the number of solutions 
is at most Zn; in this case by < on < i(n — 1), since n > 9. 

Therefore we may assume that n is the product ni...n, of distinct primes. Let 
ni = 1 + 2"iqi, where kı < --- < kr. Then ged(n — 1, ni — 1) = 2*iq!, where k; = 
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min(k, ki) and q; = gcd(q, qi). Modulo n;, the number of x such that x? = 1 is qj; and 
the number of x such that x?! = —1 is 24q/ for 0 < j < ki, otherwise 0. Since k > kı, 
we have bn = qi -qr (1+ oczek, 2°"): 

To complete the proof, it suffices to show that bn < iqi... q2 t t*r = Fo(n), 
since y(n) < n — 1. We have 


(1+ D o<j<ki Boner the < (1 J o<j<ki ar jah 
= 1/(2" — 1) + (2” — 2)/(25"(2" — 1)) < 1/27", 
so the result follows unless r = 2 and ki = k2. If r = 2, exercise 9 shows that n — 1 is 
not a multiple of both nı — 1 and nz — 1. Thus if kı = ke we cannot have both qi = qı 
and q3 = q2; it follows that qiqz < 49192 and bn < ¢(n) in this case. 


[Reference: J. Number Theory 12 (1980), 128-138.] This proof shows that pn 
is near + in only two cases, when n is (1 + 2q1)(1 + 4q1) or a Carmichael number of 
the special form (1 + 2q1)(1 + 2q2)(1 + 293). For example, when n = 49939 - 99877 we 
have bn = (49938 - 99876) and pn & .24999; when n = 1667 - 2143 - 4523, we have 


bn = 4 (1666 - 2142 - 4522), pn œ% .24968. See the next answer for further remarks.] 


23. (a) The proofs are simple except perhaps for the reciprocity law. Let p = pi... Ds 
and q = q1... qr, where the p; and q; are prime. Then 


(2) = T(=) = I j)Pi-Dqi-/4 () = (-1) 24.5 P-D -0/4 (2), 


q ig S Di p 


so we need only verify that 7, ; (p: — 1)(q; — 1)/4 = (p — 1)(q — 1)/4 (modulo 2). But 
Dig (Pi — 1) (g — 1)/4 = (i — 1/2) (90, (@ — 1)/2) is odd if and only if an odd 
number of the p; and an odd number of the q; are = 3 (modulo 4), and this holds if 
and only if (p—1)(q—1)/4 is odd. [C. G. J. Jacobi, Bericht Königl. Preuß. Akad. Wiss. 
Berlin 2 (1837), 127-136; V. A. Lebesgue, J. Math. Pures Appl. 12 (1847), 497-520, 
discussed the efficiency.] 

(b) As in exercise 22, we may assume that n = n1...n, where the nj = 1+ QR g, 


are distinct primes, and kı <--- < kr; we let ged(n — 1, nj — 1) = aki q; and we call x 
bad if it falsely makes n look prime. Let Hn = []j_, 4 gmin(ki-k—1) be the number of 
solutions of #(~)/? = 1. The number of bad z with (2) = 1 is In, times an extra 
factor of $ if ki < k. (This factor 5 is needed to ensure that (=) = —1 for an even 
number of the n; with k; < k.) The number of bad x with (2) = -1 is In if ki =k, 
otherwise 0. [If r™7®/2? = —1 (modulo n;), we have (Æ) = —1 if ki = k, (2) = +1 
if ki > k, and a contradiction if k; < k. If kı = k, there are an odd number of ki 
equal to k.] 

Notes: The probability of a bad guess is > + only if n is a Carmichael number with 
kr < k; for example, n = 7-13-19 = 1729, a number made famous by Ramanujan in 
another context. Louis Monier has extended the analyses above to obtain the following 
closed formulas for the number of bad x in general: 


or = 1\ ey ; = n—1 
bws (Tp L Bf = 6, d ,mi—1). 
(1+ war) Ls II eca( a ) 


Here b, is the number of bad zx in this exercise, and dn is either 2 (if kı = k), or 4 (if 
ki < k and e; is odd for some i), or 1 (otherwise). 

(c) If x1 mod n = 1, then 1 = (=) = (2)? = (2). If x?” = —1 (modulo n), then 
the order of x modulo n; must be an odd multiple of 2+1 for all prime divisors n; 
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ofn. Let n = nit... ng" and nj = 1+ 2/t"q/’; then (4) = (—1)%', so (2) =+41 or -1 
according as J` e;q/' is even or odd. Since n = (1 + 2+! Y e;qj’) (modulo 2/*?), the 
sum J` e;q;' is odd if and only if j+1 = k. [Theoretical Comp. Sci. 12 (1980), 97-108.] 


24. Let Mı be a matrix having one row for each nonprime odd number n in the range 
1 < n < N and having N—1 columns numbered from 2 to N; the entry in row n column 
x is 1 if n fails the x test of Algorithm P, otherwise it is zero. When N = qn +r and 
0 <r <n, we know that row n contains at most —1 + q(bn + 1) + min(b, + 1,r) < 
q(¢(n — 1) +1) + min(b, +1, r) < ġqn + min(Zn,r) = $N + min(Zn — Fr, Zr) < 
iN + in < iN entries equal to 0, so at least half of the entries in the matrix are 1. 
Thus, some column 2; of Mı has at least half of its entries equal to 1. Removing 
column zı and all rows in which this column contains 1 leaves a matrix M2 having 
similar properties; a repetition of this construction produces matrix M, with N — r 
columns and fewer than N/2" rows, and with at least (N — 1) entries per row equal 
to 1. [See FOCS 19 (1978), 78.] 

[A similar proof implies the existence of a single infinite sequence zı < z2 < --- 

such that the number n > 1 is prime if and only if it passes the x test of Algorithm P 
for © = £1, ..., © = Xm, Where m = 3(IgnJ([Ign] — 1). Does there exist a sequence 
zı < £2 < --- having this property but with m = O(log n)?] 
25. This theorem was first proved rigorously by von Mangoldt [Crelle 114 (1895), 255- 
305], who showed in fact that the O(1) term is C+ f° dt/((t?—1)tInt), minus 1/2k if x 
is the kth power of a prime. The constant C is li2—In2 = y+InIn2+)> .,(In2)"/nn! = 
0.35201 65995 57547 47542 73567 67736 43656 84471+. 

[For a summary of developments during the 100 years following von Mangoldt’s 
paper, see A. A. Karatsuba, Complex Analysis in Number Theory (CRC Press, 1995). 
See also Eric Bach and Jeffrey Shallit, Algorithmic Number Theory 1 (MIT Press, 
1996), Chapter 8, for an excellent introduction to the connection between Riemann’s 
hypothesis and concrete problems about integers.] 


n>2 


26. If N is not prime, it has a prime factor q < VN. By hypothesis, every prime 
divisor p of f has an integer £p such that the order of xp modulo q is a divisor of N — 1 
but not of (N —1)/p. Therefore if p” divides f, the order of x, modulo q is a multiple 
of p. Exercise 3.2.1.2-15 now tells us that there is an element x of order f modulo q. 
But this is impossible, since it implies that q? > (f +1)? > (f+1)r > N, and equality 
cannot hold. [Proc. Camb. Phil. Soc. 18 (1914), 29-30.] 

27. If k is not divisible by 3 and if k < 2” + 1, the number k-2” + 1 is prime if and 
only if Ciel ee (modulo k-2" + 1). For if this condition holds, k-2" + 1 is prime 
by exercise 26; and if k-2” + 1 is prime, the number 3 is a quadratic nonresidue mod 
k-2” +1 by the law of quadratic reciprocity, since (k-2" +1) mod 12 = 5. [This test was 
stated without proof by Proth in Comptes Rendus Acad. Sci. 87 (Paris, 1878), 926.] 

To implement Proth’s test with the necessary efficiency, we need to be able to 
compute z? mod (k-2” + 1) with about the same speed as we can compute the quantity 
x? mod (2” — 1). Let x? = A-2” + B; then x? = B — |A/k| + 2”(Amodk), so the 
remainder is easily obtained when k is small. (See also exercise 4.3.2—14.) 

[To test numbers of the form 3-2" + 1 for primality, the job is only slightly 
more difficult; we first try random single-precision numbers until finding one that is a 
quadratic nonresidue mod 3-2" + 1 by the law of quadratic reciprocity, then use this 
number in place of “3” in the test above. If n mod 4 Æ 0, the number 5 can be used. 
It turns out that 3-2” + 1 is prime when n = 1, 2, 5, 6, 8, 12, 18, 30, 36, 41, 66, 189, 
201, 209, 276, 353, 408, 438, 534, 2208, 2816, 3168, 3189, 3912, 20909, 34350, 42294, 
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42665, 44685, 48150, 55182, 59973, 80190, 157169, 213321, and no other n < 300000; 
and 5-2" + 1 is prime when n = 1, 3, 7, 13, 15, 25, 39, 55, 75, 85, 127, 1947, 3313, 
4687, 5947, 13165, 23473, 26607, 125413, 209787, 240937, and no other n < 300000. 
See R. M. Robinson, Proc. Amer. Math. Soc. 9 (1958), 673-681; G. V. Cormack and 
H. C. Williams, Math. Comp. 35 (1980), 1419-1421; H. Dubner and W. Keller, Math. 
Comp. 64 (1995), 397-405; J. S. Young, Math. Comp. 67 (1998), 1735-1738.] 


28. f(p,p’d) = 2/(p + 1) + f(p,d)/p, since 1/(p + 1) is the probability that A is 
a multiple of p. f(p,pd) = 1/(p + 1) when dmodp # 0. f(2,4k + 3) = $ since 
A? — (4k+3)B? cannot be a ea of 4; f, 8k+5) = 3 since A?’ —(8k+5)B? cannot 


be a multiple of 8; f(2, 8k+1) = 4444 1 i++ = $. f(p,d) = (2p/(p°—1), 0) 


if d?-)/? mod p = (1, p—1), seenedt Waly for odd p. 


29. The number of solutions to the inequality 71+---+%m < r in nonnegative integers 
x; is (n > m”/r!, and each of these corresponds to a unique integer pj... pR <n. 
For sharper estimates, in the special case that p; is the jth prime for all j, see N. G. 
de Bruijn, Indag. Math. 28 (1966), 240-247; H. Halberstam, Proc. London Math. Soc. 
3) 21 (1970), 102-107.] 


30. If pit... pg” = a? (modulo q:), we can find y; such that p$! ... pge = (4y:)° 


modulo q? *), hence by the Chinese remainder theorem we obtain 27 values of X such 
that X? = pi" .. . ppr (modulo N). Such (e1,...,@m) correspond to at most (, Ja) pairs 


el... €m; e1, ..., el) having the hinted properties. Now for each of the 2% binary 


numbers a = (a1...@a)2, let ma be the number of exponents (ej, ..., em) such that 


pi... prit)%—-)/? = (—1)% (modulo qi); we have proved that the required number of 
integers X is > 24(>, n2)/ (aya) Since 5°, Ma is the number of ways to choose at most 
r/2 objects from a set of m objects with repetitions permitted, namely eee? we 


have J, nz > (SY > m” /(2°(r/2)!?). [See J. Algorithms 3 (1982), 101-127, 
where Schnorr BNA many further refinements of Theorem D.] 


31. Set n= M, pM = 4m, and eM = 2m to show that Pr(X < 2m) < em? 


32. Let M= | VN |, and let the places x; of each message be restricted to the range 
0 < x < M? — M?. If « > M, encode it as z? mod N as before, but if £x < M 
change the encoding to (« + yM)? mod N, where y is a random number in the range 
M? — M < y < M?. To decode, first take the cube root; and if the result is M? — M? 
or more, take the remainder mod M. 


34. Let P be the probability that 2” mod p = 1 and let Q be the probability that 
x™ modq = 1. The probability that gcd(x”—1, N) = por qis Pa Q)+Q(1-— P) = 
P+Q-2PQ. If P <5 sorQ< L, this probability i is > 2(10~° — 107*?), so we have a 
good chance of finding £ a factor after about 10° log m arithmetic operations modulo N. 
On the other hand if P > 4 and Q > i then P ~ Q ~ 1, since we have the general 
formula P = gcd(m, p—1)/p; thus m is a multiple of lcm(p— 1, q — 1) in this case. Let 
m = 2*r where r is odd, and form the sequence x” mod N, x? "mod ING ihc z?" mod N; 
we find as in Algorithm P that the first appearance of 1 is preceded by a value y other 
than N — 1 with probability > L, hence gcd(y — 1, N) = p or q. 

35. Let f = (pt — q?~+) mod N. Since pmod 4 = qmod 4 = 3, we have (>) = 
(+) = (£) = (3) = —1, and we also have (5) = =(5) = —1. Given a message x in 
the range 0 <a < = 3(N — 5), let = 4a + 2 or 8x +4, whichever satisfies (>) > 0; then 
transmit the message 7° mod N. 


4.5.4 ANSWERS TO EXERCISES 665 


To decode this message, we first use a SQRT box to find the unique number y such 
that y? = z? mod N and (2) > 0 and y is even. Then y = 7, since the other square 
roots of = are N — 7 and (£fZ) mod N; the first of these is odd, and the other two 
either have negative Jacobi symbols or are simply z and N — z. The decoding is now 
completed by setting x + |y/4| if y mod 4 = 2, otherwise x + |y/8]. 

Anybody who can decode such encodings can also find the factors of N, because 
the decoding of a false message Z°? mod N when (2) = —1 reveals (+f) mod N, and 
((+f) mod N) —1 has a nontrivial gcd with N. [Reference: IEEE Transactions IT-26 


(1980), 726-729.] 


36. The mth prime equals mlnm + mlnInm — m + mInInm/Inm — 2m/Inm + 
O(m(log log m)? (log m)~?), by (4), although for this problem we need only the weaker 
estimate Dm = mln m + O(mloglogm). (We will assume that pm is the mth prime, 
since this corresponds to the assumption that V is uniformly distributed.) If we choose 
Inm = 3cV/InNInInN, where c = O(1), we find that r = ctVInN/InInN — 
c? ~c*(InInIn N/InIn N) — 2c~?(In $c)/Inn N+ O(VInIn N/InN ). The estimated 
running time (22) now simplifies somewhat surprisingly to exp( f(c, N)V In NInIn N + 
O(log log N)), where we have f(c, VN) =c+(1-(1 +1n2)/InIn N)c —!. The value of c 
that minimizes f(c, N) is V1 — (1 +1n2)/lnln N, so we obtain the estimate 


exp(2V In NInIn NV1 — (1+ In2)/InIn N + O(log log N)). 


When N = 10°° this gives e(N) ~% .33, which is still much larger than the observed 
behavior. 

Note: The partial quotients of VD seem to behave according to the distribution 
obtained for random real numbers in Section 4.5.3. For example, the first million partial 
quotients of the square root of the number 101° + 314159 include exactly (415236, 
169719, 93180, 58606) cases where An is respectively (1,2,3,4). Moreover, we have 
Vai = |P} — D| = 2VDan|pn — VDan| + O(an?) by exercise 4.5.3-12(c) and 
Eq. 4.5.3-(12). Therefore we can expect Vn/2VD to behave essentially like the quantity 
6n(x) = qn|Pn — £qn|, where x is a random real number. The random variable 6, is 
known to have the approximate density min(1,@~' — 1)/In2 for 0 < 0 < 1 [see Bosma, 
Jager, and Wiedijk, Indag. Math. 45 (1983), 281-299], which is uniform when 0 < 1/2. 
So something besides the size of Vn must account for the unreasonable effectiveness of 
Algorithm E. 


37. Apply exercise 4.5.3-12 to the number VD + R, to see that the periodic part 
begins immediately, and run the period backwards to verify the palindromic property. 
[It follows that the second half of the period gives the same V’s as the first, and 
Algorithm E could be shut down earlier by terminating it when U = U’ or V = V” in 
step E5. However, the period is generally so long, we never even get close to halfway 
through it, so there is no point in making the algorithm more complicated.] 

38. Let r = (10°° —1)/9. Then Po = 1079+ 9; Pi = r+3-10°°; Py = 2r+3-10°" +7; 
P3 = 3r +2- 10°; Py = 4r +2 -10^ —3; Ps = 5r + 3 - 10°° +4; Ps = 6r + 2- 108 +3; 
P; = Tr +2- 10” (very pretty); Ps = 8r + 108 — 7; Py = 9r — 8000. 

39. Notice that it’s easy to prove the primality of q when q — 1 has just 2 and p 
as prime factors. The only successors of 2 are Fermat primes, and the existence or 
nonexistence of a sixth Fermat prime is one of the most famous unsolved problems of 
number theory. Thus we probably will never know how to determine whether or not 
an arbitrary integer has any successors. In some cases, however, this is possible; for 
example, John Selfridge proved in 1962 that 78557 and 271129 have none [see AMM 70 
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(1963), 101-102], after W. Sierpiński had proved the existence of infinitely many odd 
numbers without a successor [Elemente der Math. 15 (1960), 73-74]. Perhaps 78557 is 
the smallest of these, although 69 other contenders for that honor still existed in 1983, 
according to G. Jaeschke and W. Keller [Math. Comp. 40 (1983), 381-384, 661-673; 
45 (1985), 637]. 

For information on the more traditional “Cunningham” form of prime chain, in 
which the transitions are p > 2p+1, see Günter Löh, Math. Comp. 53 (1989), 751-759. 
In particular, Loh found that 554688278430 - 2% — 1 is prime for 0 < k < 12. 


40. [Inf. Proc. Letters 8 (1979), 28-31.] Notice that x mod y = x — y |x/y] can be 
computed easily on such a machine, and we can get simple constants like 0 = x — zx, 
1 = |z/x], 2 = 1 + 1; we can test x > 0 by testing whether x = 1 or |x/(x — 1)| 40. 

(a) First compute l = |lgn| in O(log n) steps, by repeatedly dividing by 2; at the 
same time compute k = 2! and A = 22’ in O(log n) steps by repeatedly setting k + 
2k, A+ A’. For the main computation, suppose we know that t = A”, u=(A+1)”, 
and v = m!; then we can increase the value of m by 1 by setting m + m + 1, t + At, 
u + (A+1)u, v + um; and we can double the value of m by setting m + 2m, u + wu’, 
v + (|u/t| mod A)v?, t + t?, provided that A is sufficiently large. (Consider the 
number u in radix-A notation; A must be greater than (27) Now if n = (a... ao)2, 
let nj = (ai... aj)2; if m = nj and k = 2’ and j > 0 we can decrease j by 1 by 
setting k + |k/2], m + 2m + (|n/k] mod 2). Hence we can compute ny! for j = l, 
l-1, ..., 0 in O(logn) steps. [Another solution, due to Julia Robinson, is to compute 
n! = |B"/(”)| when B > (2n)"*"; see AMM 80 (1973), 250-251, 266.] 


(b) First compute A = 22+? as in (a), then find the least k > 0 such that 
2**"l modn = 0. If ged(n,2*!) Æ 1, let f(n) be this value; note that this gcd can 
be computed in O(log n) steps by Euclid’s algorithm. Otherwise we will find the least 
integer m such that (enya) mod n = 0, and let f(n) = ged(m,n). (Note that in this 
case 2* < m < 2**1, hence [m/2] < 2" and [m/2]! is relatively prime to n; therefore 
are mod n = 0 if and only if m! mod n = 0. Furthermore n # 4.) 


To compute m with a bounded number of registers, we can use Fibonacci numbers 
(see Algorithm 6.2.1F). Suppose we know that s = Fj, s! = Fj41,t = A, t = A? +1, 
u = (A +1), u = (A +1), v = A”, w = (A +1)”, (7) modn Æ 0, and 
Ce) mod n = 0. It is easy to reach this state of affairs with m = Fj+1, for suitably 
large j, in O(logn) steps; furthermore A will be larger than 2?°"+5), If s = 1, we set 
f(n) = ged(2m+1, n) or gcd(2m+2, n), whichever is 1, and terminate the algorithm. 
Otherwise we reduce j by 1 as follows: Set r + s, s +} s'— s, s! +} r,r +t, t+ |t'/t], 
t 4 r,r +} u, u+ |u'/u], u + r; then if ((wu/vt]| mod A) mod n Æ 0, set m + m+s, 
w 4+ wu, v + vt. 


[Can this problem be solved with fewer than O(log n) operations? Can the small- 
est, or the largest, prime factor of n be computed in O(log n) operations?] 


41. (a) Clearly r(x) = n(m) + fı(z,m) = a(m) + f(x,m) — folz, m) — f2(a,m) — 
fa(a,m) —--» when 1 < m < z. Set x = N?, m = N, and note that fk(N?, N) = 0 for 
k> 2. 

(b) We have fa(N®.N) = Von cpcqlP9S N°] = Un cpeway2(™(N*/p)—"() +1) = 
DN <p<n3/2 n(N?/p) — ERP + FON, where p and q range over primes. Hence 
fa(1000, 10) = (1200) + (1990) + (1298) + (1900) 4 (1900) + (1800) + (1900) 
(GD) + (G9) = 24 +21 +16 +15 +14 + 11+ 11 — 55 + 6 = 63. 
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(c) The hinted identity says simply that a p;-survivor is a p;—1-survivor that isn’t 
a multiple of p;. Clearly f(N?, N) = F(N’, Pr(N))- Apply the identity until reaching 
terms f(x, pj) where either j = 0 or x < N?; the result is 
a T(N) 


f(N?, N) = 5 uos ( E, 1) — 5 5 nos ( E o)l is a pj-survivor]. 


k=1 j=1 N/pj<k<N 


Now f(x,1) = |x], so the first sum is 1000 — 500 — 333 — 200 + 166 — 142 = —9 
when N = 10. The second sum is — f (2922,1) — f(42002,1) — f( 2002, 2) — f(+2% 2) 


10? 14°? 15 ? 2i 7 


f (32,3) = —100 — 71 — 33 — 24 — 9 = —237. Hence f(1000,10) = —9 + 237 = 228, 
and 7(1000) = 4 + 228 — 1 — 63 = 168. 
(d) If N? < 2™ we can construct an array in which agm_i4, = [n + 1 is a pj- 


survivor] for 1 < n < N 2 represents a sieve after j passes, and an = Gan + Qan+1 
for 1 < n < 2™. Then it is easy to compute f(x,p;) in O(m) steps when z < N?, 
and to remove multiples of p from the sieve in O(N?m/p) steps. The total running 
time to compute f(N*,N) will come to O(N? log N log log N), because >D 1/p; = 
O(log log N). 

The storage requirement can be reduced from 2N?m to 2Nm if we break the sieve 
into N parts of size N and work on each part separately. Auxiliary tables of p; for 
1 < j <7(N), and of p(k) and the least prime factor of k for 1 < k < N, are helpful 
and easily constructed before the main computation begins. 

[See Math. Comp. 44 (1985), 537-560. A similar procedure was first introduced 
by D. F. E. Meissel, Math. Annalen 2 (1870), 636-642; 3 (1871), 523-525; 21 (1883), 
304; 25 (1885), 251-257. D. H. Lehmer made several refinements in Illinois J. Math. 
3 (1959), 381-388. Neither Meissel nor Lehmer had a stopping rule for the recurrence 
that was as efficient as the method described above. Further refinements due to Marc 
Deléglise, Joél Rivat, Xavier Gourdon, and Tomas Oliveira e Silva have made it possible 
to deduce that (107%) = 1925320391606803968923; see Revista do DETUA 4 (2006), 
759-768. Lagarias and Odlyzko also developed a completely different approach whereby 
a(N) can be evaluated in O(N1/?**) steps, using principles of analytic number theory; 
see J. Algorithms 8 (1987), 173-191. But the constant in that O is impracticably large. 
42. L1. [Initialize] Find 7 such that rr = 1 (modulo s); then set r’ + nr mods, 

u + rrmods, v + s, w + (n—rr’)F/smods, 0 + |.V/N/s|, (u1, u3) — 
(1,u), (v1, v3) + (0, v). (We want to find all pairs of integers (A, p) such that 
(As + r)(us +7’) = N; this implies Au + u = w (modulo s) and Apu < 8. 
We will perform Algorithm 4.5.2X with t2, u2, v2 suppressed; the relations 


At3 + ptr = wti, Aug + pui = wui, Av3+ pvi = wri (modulo s) 


will remain invariant.) 

L2. [Try for divisors.] If vı = 0, output As +r whenever As + r divides N and 
0 < A < 0/s. If v3 = 0, output N/(us +r’) whenever us + r’ divides N 
and 0 < u < 6/s. Otherwise, for all k such that |wvı + ks| < 0 if vı < 0, 
or 0 < wv + ks < 20 if vı > 0, and for o = +1 and —1, output As +r if 
d= (wuis4 ks? +ugr4 vr’)? 4vıv3 N is a perfect square and if the numbers 


A= 


wus + ks? — vsr + vir’ +oVd wis + ks? + vsr — vir — ovd 
2U38 B 2u38 


are positive integers. (These are the solutions to Av3 + vı = wri + ks, 
(As +r)(us +r’) = N.) 
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L3. [Done?] If v3 = 0, the algorithm terminates. 


L4. [Divide and subtract.] Set q + |u3/v3]. If u3 = quz and vı < 0, decrease q 
by 1. Then set 


(t1, t3) — (ui, us) — (v1, v3)qg, (ui, us) — (v1, v3), (v1, v3) + (tı, ta) 
and return to step L2. J 


[See Math. Comp. 42 (1984), 331-340. The bounds in step L2 can be sharpened, for 
example to ensure that d > 0. Some factors may be output more than once.] 


43. (a) First make sure that the Jacobi symbol (#) is +1. (If it’s 0, the task is easy; 
if it’s —1, then y ¢ Qm.) Then choose random integers x1, ..., Zn in [0..m) and 
let X; = [G(y?’x} mod m) = (yz? mod m) mod 2]. If y € Qm we have E X; > 2+6 
otherwise m — y € Qm and E Xj <5 276 Report that y € Qm if X1 +---+ Xn Sı zn 
The probability of failure is at most 2 =2e?n , by exercise 1.2.10—21. Therefore we choose 
n= [ġe lnt]. 

(b) Find an x with Jacobi symbol (2) = —1, and set y + z? mod m. Then the 
prime factors of m are gcd(x + \/y,m) and ged(x — vy, m), so our task is to find 
+,/y when y € Qm is given. If we can find 7v for any nonzero v, we are done, since 
vy = (v-'rv) mod m unless ged(v, m) is a factor of m. 

Assume that € = 2~° for some e > 1. Choose random integers a and b in [0..m), 
and assume that we know the binary fractions ao and o such that 


Tb 


e 


Bo| < 


Ta 
ao| < 


€ . 
64’ 4’ 

here ao is an odd multiple of €/64, while 6o is an odd multiple of e&/ 64. Assume also 
that we know àa and Ab. Of course we don’t really know ao, Bo, Aa, or Ab, but we will 
try all 327! x 327? x 2 x 2 possibilities. Spurious branches of the program, which 
operate under incorrect assumptions, will cause no harm. 

Define the numbers utj = 2™* (a+ (j+ 4)b) mod m and vj = 27™*7t (a+ jb) mod m. 
Both wz; and vij are uniformly distributed in [0..m), because a and b were chosen at 
random. Furthermore, for fixed t, the numbers utj for jo < j < jo +l are pairwise 
independent, and so are the numbers vw; for jo < j < jo +l, as long as l does not 
exceed the smallest prime factor of m. We will make use of uz; and vw; only for 
—2re7? < j < 2re~*: if any of these values has a nonzero factor in common with m, 
we’re done. 

For all v L m we define xv = +1 if v E Qm, xv = —1 if —v E€ Qm, and xv = 0 
if (+) = —1. Notice that yua+2); = vue, since wey = (2 ug+2)j) mod m. Therefore 
we can determine yur; and xvr; for all t and j by applying algorithm A to uz; and vi; 
for 0 < t < 1 and —2re~? < j < 2re~?. Setting ô = moer in that algorithm will 


1440 


ensure that all y values are correct with probability > 1 — aa 
The algorithm works in at most r stages. At the beginning of stage t, for0 < t <r, 


we assume that we know \27‘a, \2~*b, and fractions az, 8; such that 


72" 
m 


e 


2tF6' 


T2 *a € 


2t+6? 


at} < 


t| < 


m 


Define œr+ı = (ar +)27*a) and 641 = (8: +27*b); this preserves the inequalities. 
The next step is to find \2~*~'b, which satisfies 


T2 ta + j72-*b + r2 "bd 
m 


duty + AQ a + GA2*b + A274104 | | =0 (modulo 2). 
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Let n = 4min(r, 2*) e7’; then when |j| < 3 we have 


72 *a 727*h | 727* 1b i € 
HJ t (ar + 7D: 4 Br41) < —. 
m m m 16 


Therefore if yur; = 1 it is likely that \2-'~'b = Gj, where Gj = (G(uz;y mod m) + 
2a + jA27*b + |atjbt + bt+1]) mod 2. More precisely, we will have 


[(72-*a + jr 2*b + 72-*~*b)/m| = lar + jbt + Beta] 


unless Tuj < 4m or Tusj > (1 — )m. Let Y; = (2G; — 1)xw;. If Yj = +1, it is 


16 


a vote for \2-'"'b = 1; if Yj = —1, it is a vote for \2~*'b = 0; if Y; = 0, it is an 
abstention. We will be democratic and set \27*~'b = Do ay Yz 0]. 


What is the probability that A2~*~'b is correct? Let Z; = —1 if xus # 0 and 
(Tuy < mor Tuj > (1—)m or G(uz;y mod m) Æ Auz;); otherwise let Zj = |xurj|. 
Since Z; is a function of uj, the random variables Z; are pairwise independent and 
have the same distribution. Let Z = DO Z;; if Z > 0, the value of \2~*~1b will 
be correct. The probability that Z; = 0 is 4, and the probability that Z; = +1 is 
> + +5 — 4) therefore EZ; > Be. Clearly var(Z,;) < L. So the chance of error, in 
the branch of the program that has the correct assumptions, is at most Pr(Z < 0) < 
Pr((Z — nE Zj} > n?) < nte? = 2 min(r,2*) 1, by Chebyshev’s inequality 
(exercise 3.5—42). 

A similar method, with vz; in place of ue;, can be used to determine dA2-* 1a with 
error < 2 min(r, 2‘)~*. Eventually we will have €*/2'*® < 1/(2m), so 72~‘b will be the 
nearest integer to mf:. Then we can compute yy = (2°b-1 72 ‘d) mod m; squaring 
this quantity will tell us if we are correct. 

The total change of making a mistake is bounded by #4 FSi t= =; in stages 
t < Ign, and by 4 5 Erat = $ in subsequent stages. So the total chance of error, 
including the possibility that the x values were not all correct, is at most 4 gta 4 +96 5 = =<. 
At least 5 of all runs of the program will succeed in finding yy; hence the factors ofm 
will be found after repeating the process at most 10 times, on the average. 

The total running time is dominated by O(re~* log(re~*) T(G)) for the x compu- 
tation, plus O(r?e~?T(G)) for subsequent guessing, plus O(r?e~°) for the calculations 
of at, Bt, A27*a, and \2~*Dd in all branches. 

This procedure, which nicely illustrates many of the basic paradigms of randomized 
algorithms, is due to R. Fischlin and C. P. Schnorr [J. Cryptology 13 (2000), 221- 
244], who derived it from earlier approaches by Alexi, Chor, Goldreich, and Schnorr 
[SICOMP 17 (1988), 194-209] and by Ben-Or, Chor, and Shamir [STOC 15 (1983), 
421-430]. When we combine it with Lemma 3.5P4, we get a theorem analogous to 
Theorem 3.5P, but with the sequence 3.2.2—(16) instead of 3.2.2-(17). Fischlin and 
Schnorr showed how to streamline the calculations so that their factoring algorithm 
takes O(re~* log(re~')T(G)) steps; the resulting time bound for “cracking” 3.2.2-(16) 
is T(F) = O(RN*e~* log(RNe—')(T(G) + R?)). The constant factor implied by this 
O is rather large, but not enormous. A similar method finds x from the RSA function 
y = x" mod m when a L (m), if we can guess y'/* mod 2 with probability > ste. 


44. Suppose Ses, aijz? = 0 (modulo mi), gcd(aio, ai1,...,@i(a—-1),Mi) = 1, and 
|x| < m; for 1 < i < k = d(d — 1)/2 + 1, where m; L mj for 1 < i<j <k. 
Also assume that m = min{m1,...,mx} > n”/227?/2dd, where n = d+ k. First find 
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Ui, ..-, Ue Such that uj mod m; = 6,;. Then set up the n x n matrix 
M 
0 mM 
E 0 0 a m?-!M 
T | aoui Manur ... m? taa-iju M/mid 
azou2 Ma21U2 ... m*"ao¢a—1) U2 0 M/me2d 
Anouk MAakiUk ... mo ax(a—1) Uk 0 0 ... M/mrd 
where M = mimg2...m x; all entries above the diagonal are zero, hence det L = 
M” m" td". Now let v = (to,...,ta-1,V1,-..,Uk) be a nonzero integer vector 


with length(vL) < V/n27>M@-D/%mF-D/ng-k/™ Since M@-Y/™ < M/m*/", we 
have length(vL) < M/d. Let cj = t;M + Dh aijuivi and P(x) = co + az +- 4 
ca-127 1. Then P(x) = vi(ai + ans +4 Qita—1)2" *) = 0 (modulo mi), for 
1<i<-k; hence P(x) =0 (modulo M). Also |m’c;| < M/d; it follows that P(x) = 0. 
But P(x) is not identically zero, because the conditions viai; = 0 (modulo m;) and 
gcd(aio,.-.,@i(a—1), Mi) = 1 imply v; = 0 (modulo m;), while |v;M/mid| < M/d 
implies |v;| < mi; we cannot have vy = --- = vk = 0. Thus we can find x (more 
precisely, at most d — 1 possibilities for x), and the total running time is polynomial 
in lg M. [Lecture Notes in Comp. Sci. 218 (1985), 403-408. ] 


45. Fact 1. A solution always exists. Suppose first that n is prime. If (2) = 1; 
there is a solution with y = 0. If (2) = —1, let j > 0 be minimum such that we have 
(=) = —1; then x2 — a = —ja and b = —ja(yo)? for some zo and yo (modulo n), 
hence (zoyo)? — ay = b. Suppose next that we have found a solution x? — ay? = 
(modulo n) and we want to extend this to a solution modulo n?. We can always find c 
and d such that («+cn)?—a(y+dn)? = b (modulo n°), because (x+en)?—a(y+dn)? = 
x? — ay? + (2cx — 2ayd)n and gcd(2x,2ay) L n. Thus a solution always exists when 
n is a power of an odd prime. (We need to assume that n is odd because, for example, 
there is no solution to £? +y? = 3 (modulo 8).) Finally, a solution exists for all odd n, 
by the Chinese remainder theorem. 

Fact 2. The number of solutions, given a and n with a L n, is the same for 
all b L n. This follows from the hinted identity and Fact 1, for if x? — ay? = b then 
(v1 22 — ay1y2,©1y2 + £241) runs through all solutions of r? — ay? = b as (x2, y2) runs 
through all solutions of x? — ay? = 1. In other words, (x2, yz) is uniquely determined 
by (21,41) and (x,y), when z? — ay? Ln. 

Fact 3. Given integers (a, s, z) such that z? = a (modulo s), we can find integers 
(x,y,m, t) with x” —ay? = mst, where (x, y) # (0,0) and t? < $la|. For if z? =a+ms, 
let (u,v) be a nonzero pair of integers that minimizes (zu + mv)? + |a|u?. We can find 
(u, v) efficiently using the methods of Section 3.3.4, and (zu + mv)? +|alu? < (lal)! 
by exercise 3.3.4-9. Therefore (zu + mv)? — au” = mt where t? < 4|a|. The hinted 
identity now solves x? — ay? = (ms)(mt). 

Fact 4. It is easy to solve x? —y” = b (modulo n): Let = (b+1)/2, y = (b—-1)/2. 


Fact 5. It is not difficult to solve x? + y? = b (modulo n), because the method 
in exercise 3.3.4-11 solves x? + y? = p when p is prime and p mod 4 = 1; one of the 
numbers b, b+ n, b+ 2n, ... will be such a prime. 


Now to solve the stated problem when |a| > 1 we can proceed as follows. Choose 
u and v at random between 1 and n — 1, then compute w = (u? — av”) mod n and 
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d = gcd(w,n). If 1 < d < n or if gcd(v,n) > 1 we can reduce n; the methods used 
to prove Fact 1 will lift solutions for factors of n to solutions for n itself. If d = n 
and v L n, we have (u/v)? = a (modulo n), hence we can reduce a to 1. Otherwise 
d = 1; let s = bw mod n. This number s is uniformly distributed among the numbers 
prime to n, by Fact 2. If (2) = 1, try to solve 2? = a (modulo s), assuming that s is 
prime (exercise 4.6.2-15). If unsuccessful, start over with another random choice of u 
and v. If successful, let z2? = a+ ms and compute d = gcd(ms,n). If d > 1, reduce 
the problem as before. Otherwise use Fact 3 to find z? — ay? = m?st with t? < sal; 
this makes (a/m)? — a(y/m)? = st (modulo n). If t = 0, reduce a to 1. Otherwise 
apply the algorithm recursively to solve X? — tY? = a (modulo n). (Since t is much 
smaller than a, only O(log log n) levels of recursion will be necessary.) If gcd(Y,n) > 1 
we can reduce n or a; otherwise (X/Y)? — a(1/Y)? = t (modulo n). Finally the hinted 
identity yields a solution to x’? — ay’? = s (see Fact 2), which leads in turn to the 
desired solution because u? — av? = s/b. 

In practice only O(log n) random trials are needed before the assumptions about 
prime numbers made in this algorithm turn out to be true. But a formal proof would re- 
quire us to assume the Extended Riemann Hypothesis [IEEE Trans. IT-33 (1987), 702- 
709]. Adleman, Estes, and McCurley [Math. Comp. 48 (1987), 17-28] have developed a 
slower and more complicated algorithm that does not rely on any unproved hypotheses. 
46. [FOCS 20 (1979), 55-60.] After finding a™* mod p = J]; Dp," for enough ni, 
we can solve J); Zijkeij + (p — 1)tjx = 5j% in integers Tijk, tjk for 1 < j,k < m (for 
example, as in 4.5.2-(23)), thereby knowing the solutions Nj = (7, £ijkejk) mod (p—1) 
to aNi mod p = pj. Then if ba” modp = M- pei, we have n +n = Dga eNi 
(modulo p — 1). [Improved algorithms are known; see, for example, Coppersmith, 
Odlyzko, and Schroeppel, Algorithmica 1 (1986), 1-15.] 

47. Earlier printings of this book had a 211-digit N, which was cracked in 2012 using 


the elliptic curve method and the general number field method by Greg Childers and 
about 500 volunteers(!). 


SECTION 4.6 

1. 9a? + Tx + 7; 52? + 7x? + 20 +6. 

2. (a) True. (b) False if the algebraic system S contains zero divisors, that is, nonzero 
numbers whose product is zero, as in exercise 1; otherwise true. (c) True when m Æ n, 
but false in general when m = n, since the leading coefficients might cancel. 

3. Assume that r < s. For 0 < k <r the maximum is mımə(k +1); forr<k < sit 
is mimo(r4+ 1); for s < k < r+ s it is mymo(r+s+1-—k). The least upper bound 
valid for all k is mımə(r + 1). (The solver of this exercise will know how to factor the 
polynomial a” + 22° + 32° + 3x4 + 3r? + 3a? + 2x + 1.) 


4. If one of the polynomials has fewer than 2’ nonzero coefficients, the product can be 
formed by putting exactly t— 1 zeros between each of the coefficients, then multiplying 
in the binary number system, and finally using a bitwise AND instruction (present on 
most binary computers, see Algorithm 4.5.4D) to zero out the extra bits. For example, 
if t = 3, the multiplication in the text would become (1001000001)2 x (1000001001)2 = 
(1001001011001001001)2; the desired answer is obtained if we AND this result with the 
constant (1001001...1001)2. A similar technique can be used to multiply polynomials 
with nonnegative coefficients that are not too large. 

5. Polynomials of degree < 2n can be written Ui(a)x” + Uo(x) where deg(U1) < n 
and deg(Uo) < n; and (Ui(x)a” + Uo(a))(Vi(a)x” + Vo(x)) = Ui (x) Vi (ax) (a?" +x”) + 
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(Ui (x) + Uo(x))(Vi(x) + Vo(x))x” + Uo(x)Vo(x)(x” +1). (This equation assumes that 
arithmetic is being done modulo 2.) Thus Eqs. 4.3.3-(3) and 4.3.3-(5) hold. 

Notes: S. A. Cook has shown that Algorithm 4.3.3T can be extended in a similar 
way; and A. Schénhage [Acta Informatica 7 (1977), 395-398] has explained how to 
multiply polynomials mod 2 with only O(nlognloglogn) bit operations. In fact, 
polynomials over any ring S can be multiplied with only O(n log nlog log n) algebraic 
operations, even when S is an algebraic system in which multiplication need not be 
commutative or associative [D. G. Cantor and E. Kaltofen, Acta Informatica 28 (1991), 
693-701]. See also exercises 4.6.4-57 and 4.6.4-58. But these ideas are not useful for 
sparse polynomials (having mostly zero coefficients). 


SECTION 4.6.1 
1. q(x) = 1-292? +0-2?x? — 2. 2x +8 = 8r? — 4g +8; r(x) = 28x? + 4x 4+ 8. 


2. The monic sequence of polynomials produced during Euclid’s algorithm has the 
coefficients (1,5,6,6,1,6,3), (1,2,5,2,2,4,5), (1,5,6,2,3,4), (1,3,4,6), 0. Hence the 
greatest common divisor is £? + 327 + 4x +6. (The greatest common divisor of a 
polynomial and its reverse is always symmetric, in the sense that it is a unit multiple 
of its own reverse.) 


3. The procedure of Algorithm 4.5.2X is valid, with polynomials over S substituted 
for integers. When the algorithm terminates, we have U(x) = u2(x), V(x) = ui(ax). Let 
m = deg(u), n = deg(v). It is easy to prove by induction that deg(us) + deg(v1) = n, 
deg(u3) + deg(v2) = m, after step X3, throughout the execution of the algorithm, 
provided that m > n. Hence if m and n are greater than d = deg(gced(u, v)) we have 
deg(U) < m — d, deg(V) < n — d; the exact degrees are m — dı and n — dj, where dı 
is the degree of the next-to-last nonzero remainder. If d = min(m, n), say d = n, we 
have U(x) = 0 and V(x) = 1. 

When u(x) = 2” — 1 and v(x) = x” — 1, the identity (x™ — 1) mod (z” — 1) = 


gm mod” _ 1 shows that all polynomials occurring during the calculation are monic, 
with integer coefficients. When u(x) = 2?) — 1 and v(x) = x? — 1, we have V(x) = 
x! +8 +r? +r? +1 and U(x) = — (x° +x +2 +r +r? +r? +e? + e). [See also 


Eq. 3.3.3-(29), which gives an alternative formula for U(x) and V(x). See also exercise 
4.3.2-6, with 2 replaced by zx.] 


4. Since the quotient q(x) depends only on v(x) and the first m—n coefficients of u(x), 
the remainder r(x) = u(x)—4q(x)v(2) is uniformly distributed and independent of v(x). 
Hence each step of the algorithm may be regarded as independent of the others; this 
algorithm is much more well-behaved than Euclid’s algorithm over the integers. 

The probability that nı = n — k is p'~*(1 —1/p), and t = 0 with probability p~”. 
Each succeeding step has essentially the same behavior; hence we can see that any given 
sequence of degrees n, ni, ..., Nt, —Co occurs with probability (p — 1)'/p". To find 
the average value of f(mi,..., nt), let S+ be the sum of f (n1, ..., n+) over all sequences 
n > nı >- > n > 0 having a given value of t; then the average is X, S:(p — 1)’/p”. 

Let f(n1,... nt) = t; then S: = (7) t, so the average is n(1 — 1/p). Similarly, if 
f(mi,..., nt) = ni +- + m, then S+ = (2) (a and the average is Da — 1/p). 
Finally, if f(n1,..., nt) = (n — ni) ni +--+ + (mt-1 — nt) ne, then 


S= (The) — m+ D) + (2) O 


and the average is (35) (n+ 1)p/(p— 1) + (p/(p 1)’ 1/p”**). 


4.6.1 ANSWERS TO EXERCISES 673 


(The probability that nj; = nj —1 for 1 < j < t = n is (1 — 1/p)”, obtained 
by setting S+ = [t =n]; so this probability approaches 1 as p — oo. As a consequence 
we have further evidence for the text’s claim that Algorithm C almost always finds 
d2 = 63 =--- = 1, because any polynomials that fail the latter condition will fail the 
former condition modulo p for all p.) 


5. Using the formulas developed in exercise 4, with f(ni,...,nz) = [nz =0], we find 
that the probability is 1 — 1/p ifn > 0, 1 ifn = 0. 

6. Assuming that the constant terms u(0) and v(0) are nonzero, imagine a “right- 
to-left” division algorithm, u(x) = v(x)q(x) +2” r(x), where deg(r) < deg(v). We 
obtain a gcd algorithm analogous to Algorithm 4.5.2B, which is essentially Euclid’s 
algorithm applied to the “reverse” of the original inputs (see exercise 2), afterwards 
reversing the answer and multiplying by an appropriate power of x. 

There is a similar algorithm analogous to the method of exercise 4.5.2-40. The 
average number of iterations for both algorithms has been found by G. H. Norton, 
SICOMP 18 (1989), 608-624; K. Ma and J. von zur Gathen, J. Symbolic Comp. 9 
(1990), 429-455. 


7. The units of S (as polynomials of degree zero). 


8. If u(x) = v(x) w(x), where u(x) has integer coefficients while v(x) and w(x) have 
rational coefficients, there are nonzero integers m and n such that m- v(x) and n- w(x) 
have integer coefficients. Now u(x) is primitive, so Eq. (4) implies that 


u(x) = pp((m - v(x))(n- w(x))) = £pp(m- v(x)) pp(n - w( 


(x). 

9. We can extend Algorithm E as follows: Let (u(x), u2(a) ), ug, ua(a (x)) and (v1( 
v2(zx), v3, v4(x)) be quadruples that satisfy the relations u(x) u(x) +u2(x) u(x) = aa i 
and vı(x)u(x)+v2(x)v(x) = v3va(x). The extended algorithm starts with the quadru- 
ples (1,0, cont(u), pp(u(x))) and (0, 1, cont(v), pp(v(x))) and manipulates them in such 
a way as to preserve the conditions above, where u4(x) and v4(x) run through the 
same sequence as u(x) and v(x) do in Algorithm E. If au4(x) = q(x)va(x) + br(x), 
we have av3(ui(z), u2(x)) — q(x)us(vı (x), v2(x)) = (rı (x), r2(x)), where ri(x)u(x) + 
r2(x)v(x) = buzvsr(x), so the extended algorithm can preserve the desired relations. 
If u(x) and v(x) are relatively prime, the extended algorithm eventually finds r(x) of 
degree zero, and we obtain U(x) = r(x), V(x) = ri(x) as desired. (In practice we 
would divide ri(z), r2(x), and busv3 by ged(cont(ri), cont(r2)).) Conversely, if such 
U(x) and V(x) exist, then u(x) and v(x) have no common prime divisors, since they 
are primitive and have no common divisors of positive degree. 


10. By successively factoring reducible polynomials into polynomials of smaller de- 
gree, we must obtain a finite factorization of any polynomial into irreducibles. The 
factorization of the content is unique. To show that there is at most one factorization 
of the primitive part, the key result is to prove that if u(x) is an irreducible factor of 
v(x) w(x), but not a unit multiple of the irreducible polynomial v(x), then u(x) is a 
factor of w(x). This can be proved by observing that u(x) is a factor of v(x)w(x)U (x) = 
rw(x) — w(x) u(x)V (x) by the result of exercise 9, where r is a nonzero constant. 


11. The only row names needed would be Aj, Ao, Ba, B3, B2, Bi, Bo, Ci, Co, Do. In 
general, let uj+2(x) = 0; then the rows needed for the proof are Anj—n,; through Ao, 
Bny=n; through Bo, Crg—n; through Co, Dng—n; through Do, etc. 


12. If nk = 0, the text’s proof of (24) shows that the value of the determinant is thx, 


and this equals EL Placer oo ~) Ifthe polynomials have a factor of positive 
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degree, we can artificially assume that the polynomial zero has degree zero and use the 
same formula with ¢, = 0. 

Notes: The value R(u,v) of Sylvester’s determinant is called the resultant of u 
and v, and the quantity (—1)?8™“)(4e8™—D/2¢(y)-! R(u, u’) is called the discriminant 
of u, where w’ is the derivative of u. If u(x) has the factored form a(x — a1)... (£ — am), 
and if v(x) = b(x — bı)... (x — Bn), the resultant R(u,v) is a”v(ai)...v(am) = 
(—1) b u(Br)..-u(Bn) = a”b™ TT", []_, (ai — 84). It follows that the polynomials 
of degree mn in y defined as the respective resultants with u(x) of u(y — x), uly + z), 
x” u(y/x), and u(yx) have as respective roots the sums a; + 8j, differences a; — bj, 
products a;8;, and quotients a;/8; (when v(0) # 0). This idea has been used by 
R. G. K. Loos to construct algorithms for arithmetic on algebraic numbers [Computing, 
Supplement 4 (1982), 173-187]. 

If we replace each row A; in Sylvester’s matrix by 


(bo Ai + br Asta + +++ + Ong—1-iAny—1) — (@oBi + a1 Bi+1 + +++ + Gny-1-iBny-1), 


and then delete rows Bn,—1ı through Bo and the last n2 columns, we obtain an nı x nı 
determinant for the resultant instead of the original (ni + n2) x (nı + n2) determinant. 
In some cases the resultant can be evaluated efficiently by means of this determinant; 
see CACM 12 (1969), 23-30, 302-303. 

J. T. Schwartz has shown that it is possible to evaluate resultants and Sturm 
sequences for polynomials of degree n with a total of O(n(log n)”) arithmetic operations 
as n — oo. [See JACM 27 (1980), 701-717.] 


13. One can show by induction on j that the values of (uj+1(x), gj+1, hj) are replaced 
respectively by (T? w(a)u;(a), tP gj, €?ih;) for j > 2, where pj = ni + no — 2nj. 
[In spite of this growth, the bound (26) remains valid.] 

14. Let p be a prime of the domain, and let j,k be maximum such that p*\v, = &(v), 


p)\Un-1- Let P = p". By Algorithm R we may write q(x) = ao + Paix +--+ P*easx’*, 
where s = m—n > 2. Let us look at the coefficients of x"t', x”, and x"~' in v(x) q(a), 
namely Paiun + P?azvun-i +- , G0Un + Paivn—-1 +--+, and aoUvn—1 + Paivn—2+:°-, 


each of which is a multiple of P?. We conclude from the first that p’ \a1, from the 
second that p™™?9)\ao, then from the third that P\ao. Hence P\r(x). [If m were 
only n+ 1, the best we could prove would be that plh/ 2] divides r(x); for example, 
consider u(x) = z? +1, v(x) = 4a? + 22 +1, r(x) = 18. On the other hand, an 
argument based on determinants of matrices like (21) and (22) can be used to show 
that £(r)s8)—4e(")-1 (2) is always a multiple of £(v) 408“) — des) (deg(v) —deg(r)—1) ] 

15. Let cij = ai1aj1 +--+ +inajn; we may assume that ci; > 0 for all i. If ci; 4 0 for 
some i Æ j, we can replace row 7 and column i by (ci1 — tcj1, ..., Cin — tejn), where 
t = c;/c;;; this does not change the value of det C, and it decreases the value of the 
upper bound we wish to prove, since ci; is replaced by Cii — ci; /cj;- Such replacements 
can be done in a systematic way for increasing 7 and for j < i, until ci; = 0 for all 
i Æ j. [The latter algorithm is called the Gram-Schmidt orthogonalization process: See 
Crelle 94 (1883), 41-73; Math. Annalen 63 (1907), 442.] Then det(A)? = det( AAT) = 


Ci1---Cnn- 


16. A univariate polynomial of degree d over any unique factorization domain has at 
most d roots (see exercise 3.2.1.2-16(b)); so if n = 1 it is clear that |r(S1)| < di. fn >1 
we have f(21,...,%n) = go(®2,..-,%n) + £1g1(£2,..., En) + + x” ga, (xe, Snagit) 
where gx is nonzero for at least one k. Given (x2,...,2n), it follows that f(x1,..., £n) 
is zero for at most dı choices of x1, unless gx(x2,...,%n) = 0; hence |r($1,...,Sn)| < 
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and R. J. Lipton, Inf. Proc. Letters 7 (1978), 193-195.] 

Notes: The stated upper bound is best possible, because equality occurs for the 
polynomial f(x1,..., £n) = []{vj — sk | sk E Sj, 1 < k < dj, 1 < j <n}. But 
there is another sense in which the upper bound can be significantly improved: Let 
fi(ai,...,@n) = f(£1,..., £n), and let fj+1(£j+1,.--, 2n) be any nonzero coefficient of 
a power of x; in fj(£j,...,£n). Then we can let dj be the degree of x; in fj instead 
of the (often much larger) degree of x; in f. For example, we could let dı = 3 and 
d2 = 1 in the polynomial xia — 34222 + c4% +5. This observation ensures that 
dı +---+d, < d when each term of f has total degree < d; hence the probability in 
such cases is 


Ir(S,---5)] oy (: z) a) sot te < d 
[S] E [S| [S17 = [S| ~ [s] 
when all sets S; are equal. If this probability is < 4, and if f(x£1,..., £n) turns out to 
be zero for 50 randomly selected vectors (£1,..., £n), then f(z£1,..., £n) is identically 
zero with probability at least 1 — 27®°. 
Moreover, if fj(£j,..., £n) has the special form a? fi4i(@j41,---,2n) with ej >0 
we can take dj = 1, because x; must then be 0 when fj+41(%j41,...,%n) #0. A sparse 


polynomial with only m nonzero terms will therefore have d; < 1 for at least n — lg m 
values of 7. 

Applications of this inequality to gcd calculation and other operations on sparse 
multivariate polynomials were introduced by R. Zippel, Lecture Notes in Comp. Sci. 72 
(1979), 216-226. J. T. Schwartz [JACM 27 (1980), 701-717] gave further extensions, 
including a way to avoid large numbers by means of modular arithmetic: If the coeffi- 
cients of f are integers, if P is a set of prime numbers all > q, and if |f(@1,...,@n)| < L 
whenever each x; E€ S;, then the number of solutions to f(r1,...,2n) = 0 (modulo p) 
for p € P is at most 


[51] ---SnllPl — (1811 — di)... (187| — da) (IPI — log, £). 


17. (a) For convenience, let us describe the algorithm only for A = {a,b}. The hy- 
potheses imply that deg(QiU) = deg(Q2V) > 0, deg(Q1) < deg(Qz2). If deg(Qi) = 0, 
then Qı is just a nonzero rational number, so we set Q = Q2/Qı. Otherwise we let 
Qi = aQı1ı + bQi2 +171, Q2 = aQ21 + bQ22 + r2, where rı and r2 are rational numbers; 
it follows that 


QiU — Q2V = a(QuiU — QaiV) + b(Q12U — Qa2V) + 11U — reV. 


We must have either deg(Q11) = deg(@1) — 1 or deg(Q12) = deg(@Q1)—1. In the former 
case, deg(QiiU — QaiV) < deg(QiiU), by considering the terms of highest degree that 
start with a; so we may replace Q1 by Q11, Q2 by Q21, and repeat the process. Similarly 
in the latter case, we may replace (Qi, Q2) by (Q12, Q22) and repeat the process. 

(b) We may assume that deg(U) > deg(V). If deg(R) > deg(V), note that QU — 
Q2V = Qi R—(Q2 — QiQ)V has degree less than deg(V) < deg(Q:iR), so we can repeat 
the process with U replaced by R; we obtain R = Q'V + R', U =(Q4+Q')V+R’, 
where deg(R’) < deg(R), so eventually a solution will be obtained. 

(c) The algorithm of (b) gives Vı = UV2 + R, deg(R) < deg(V2); by homogeneity, 
R = 0 and U is homogeneous. 
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(d) We may assume that deg(V) < deg(U). If deg(V) = 0, set W + U; otherwise 
use (c) to find U = QV, so that QVV = VQV, (QV —VQ)V = 0. This implies that 
QV = VQ, so we can set U + V, V + Q and repeat the process. 

For further details about the subject of this exercise, see P. M. Cohn, Proc. 
Cambridge Phil. Soc. 57 (1961), 18-30. The considerably more difficult problem of 
characterizing all string polynomials such that UV = VU has been solved by G. M. 
Bergman [Ph.D. thesis, Harvard University, 1967]. 


18. [P. M. Cohn, Transactions of the Amer. Math. Soc. 109 (1963), 332-356.] 


S1. Set u, © Uj, u2 + Uz, v & Vi, v & Va, z — z3 wr + wh & 1, 
zi & ze wi } wz +} 0, n +0. 


S2. (At this point the identities given in the exercise hold, and uivi = u2v2; 
v2 = 0 if and only if uı = 0.) If v2 = 0, the algorithm terminates with 
gcrd(V1, V2) = v1, Ielm(Vi, V2) = z1 V1 = —24V2. (Also, by symmetry, we have 
gcld(U1, U2) = u2 and lerm(U1, U2) = Uiwi = —U2w2.) 


S3. Find Q and R such that vı = Quz + R, where deg(R) < deg(v2). (We have 

ui(Qve2 + R) = U2U2, SO uR = (u2 — uiQ)v2 = R'v2.) 

S4. Set (wi, w2, wi, W3, 21, 22, Zi, 24, U1, U2, V1, V2) — (wi — W1Q, ws — weQ, 

w1, W2, 21, 2, 21 — Q21, 22 — Qz4, u2 — U1Q, wi, v2, v1 — Qu2) and n + n+1. 
Go back to S2. I 

This extension of Euclid’s algorithm includes most of the features we have seen 
in previous extensions, all at the same time, so it provides new insight into the special 
cases already considered. To prove that it is valid, note first that deg(v2) decreases in 
step S4, so the algorithm certainly terminates. At the conclusion of the algorithm, vı is 
a common right divisor of Vi and V2, since wiv1 = (—1)”Vı and —w2v1 = (—1)” V2; also 
if d is any common right divisor of Vi and V2, it is a right divisor of z1 Vı + z2 V2 = v1. 
Hence vı = gerd(Vi, V2). Also if m is any common left multiple of Vi and V2, we may 
assume without loss of generality that m = U1 Vı = U2V2, since the sequence of values 
of Q does not depend on U; and Uz. Hence m = (—1)” (—u221)Vı = (—1)” (u2z22)V2 is 
a multiple of z{Vi. 

In practice, if we just want to calculate gerd(Vi, V2), we may suppress the compu- 
tation of n, wi, w2, W1, W3, 21, 22, 21, 2. These additional quantities were added to 
the algorithm primarily to make its validity more readily established. 

Note: Nontrivial factorizations of string polynomials, such as the example given 
with this exercise, can be found from matrix identities such as 


a 1 b i c 1 0 1 0 1 Oo 1\_/1 0 
1 0 1 0 1 0 1 -c 1 —b 1 -a/) \O 17’ 
since these identities hold even when multiplication is not commutative. For example, 


(abe + a+c)(1+ ba) = (ab+1)(cba+a+c). 


(Compare this with the continuant polynomials of Section 4.5.3.) 


19. [See Eugène Cahen, Théorie des Nombres 1 (Paris: 1914), 336-338.] If such an 
algorithm exists, D is a gcrd by the argument in exercise 18. Let us regard A and 
B as a single 2n x n matrix C whose first n rows are those of A, and whose second 
n rows are those of B. Similarly, P and Q can be combined into a 2n x n matrix 
R; X and Y can be combined into an n x 2n matrix Z. The desired conditions now 
reduce to two equations C = RD, D = ZC. If we can find a 2n x 2n integer matrix U 
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with determinant +1 such that the last n rows of U~'C are all zero, then R = (first n 
columns of U), D = (first n rows of U~'C), Z = (first n rows of U~') solves the desired 
conditions. Hence, for example, the following algorithm may be used (with m = 2n): 


Algorithm T (Triangularization). Let C be an m x n matrix of integers. This 
algorithm finds m x m integer matrices U and V such that UV = I and VC is upper 
triangular. (This means that the entry in row i and column j of VC is zero if i > j.) 


T1. [Initialize.] Set U + V + I, the m x m identity matrix; and set T + C. 
(Throughout the algorithm we will have T = VC and UV = I.) 


T2. [Iterate on j.] Do step T3 for j = 1, 2, ..., min(m, n), then terminate the 
algorithm. 


T3. [Zero out column j.] Perform the following actions zero or more times until 
Tı; is zero for alli > j: Let Ty; be a nonzero element of {Tij,Ti41);,---, Tm} 
having the smallest absolute value. Interchange rows k and j of T and of V; 
interchange columns k and j of U. Then subtract |T;;/T;;] times row j from 
row i, in matrices T and V, and add the same multiple of column 7 to column j 
in matrix U, forj <i<m. I 
For the stated example, the algorithm yields (3 2) = (3 ly 2), (3 a) = (2 Ie 22); 


G =i) =i. Bie a) + (S AE D (Actually any matrix with determinant +1 would 


be a gerd in this particular case.) 


20. See V. Y. Pan, Information and Computation 167 (2001), 71-85. 


21. To get an upper bound, we may assume that Algorithm R is used only when 
m—n < 1; furthermore, the coefficients are bounded by (26) with m = n. [The stated 
formula is, in fact, the execution time observed in practice, not merely an upper bound. 
For more detailed information see G. E. Collins, Proc. 1968 Summer Inst. on Symbolic 
Mathematical Computation, edited by Robert G. Tobey (IBM Federal Systems Center: 
June 1969), 195—231.] 


22. A sequence of signs cannot contain two consecutive zeros, since uz+41(x) is a 
nonzero constant in (29). Moreover we cannot have “+, 0, +” or “—, 0, —” as 
subsequences. The formula V(u,a) — V (u,b) is clearly valid when b = a, so we must 
only verify it as b increases. The polynomials u;(x) have finitely many roots, and 
V (u,b) changes only when b encounters or passes such roots. Let x be a root of some 
(possibly several) uj. When b increases from x — € to x, the sign sequence near j goes 
from “+, +, —” to “+, 0, —” or from “—, +, +” to “—, 0, +” if 7 > 0; and from “+, 
—” to “0, —” or from “—, +” to “0, +” if j = 0. (Since u’(zx) is the derivative, u’ (a) is 
negative when u(x) is decreasing.) Thus the net change in V is —dj9. When b increases 
from x to x + €, a similar argument shows that V remains unchanged. 

[L. E. Heindel, JACM 18 (1971), 533-548, has applied these ideas to construct 
algorithms for isolating the real zeros of a given polynomial u(x), in time bounded by 
a polynomial in deg(w) and log N, where all coefficients yj are integers with |u;| < N, 
and all operations are guaranteed to be exact.] 


23. Ifv has n—1 real roots occurring between the n real roots of u, then (by considering 
sign changes) u(x) mod v(x) has n — 2 real roots lying between the n — 1 roots of v. 


24. First show that hj = gt T ga takeli) Then show that 


the exponent of g2 on the left-hand side of (18) has the form 62 + 612, where x = 
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bg +--+ + ija +1 — 62(63 + --- + 6;-1 + 1) — 63(1 — 62) (64 + --- + 6j-1 +1) 
6j;-1(1 — 62)...(1 — 6;-2)(1). But z = 1, since it is seen to be independent of 5;—1 
and we can set 6;-1 = 0, etc. A similar derivation works for g3, g4, ..., and a simpler 


derivation works for (23). 


25. Each coefficient of u;(x) can be expressed as a determinant in which one column 
contains only (u), (v), and zeros. To use this fact, modify Algorithm C as follows: 
In step C1, set g + ged(¢(u), 2(v)) and h + 0. In step C3, if h = 0, set u(x) + v(x), 
v(x) + r(x)/g, h + L(u)?/g, g + Lu), and return to C2; otherwise proceed as in the 
unmodified algorithm. The effect of this new initialization is simply to replace w,;(x) 
by uj(x)/ged(€(u), &(v)) for all j > 3; thus, €?7~* will become £77—° in (28). 


26. In fact, even more is true. Note that the algorithm in exercise 3 computes +p, (x) 
and +q,(x) for n > —1. Let en = deg(qn) and dn = deg(pnu—dnv); we observed in exer- 
cise 3 that dn—-1+en = deg(u) for n > 0. We shall prove that the conditions deg(q) < en 
and deg(pu—qv) < dn—2 imply that p(x) = c(x)pn—1(x) and q(x) = c(x) qn-1(x): Given 
such p and q, we can find c(#) and d(x) such that p(x) = c(#)pn—1(x) + d(x)p,(x) and 
a(z) = e(2)qa-1(2) + d()qn(2), since pa-1(2)ga(2) — Pa()dn1(#) = +1. Hence 
pu— qu = C(pn—1U— dn-1v) + d(pnu— nv). If d(x) 4 0, we must have deg(c) + en-1 = 
deg(d) + en, since deg(q) < deg(q,); it follows that deg(c) + dn-ı > deg(d) + dn, since 
this is surely true if dn = —co and otherwise we have dn—1+é€n = dn+en+1 > dn+en-1. 
Therefore deg(pu — qv) = deg(c) + dn_-1. But we have assumed that deg(pu — qv) < 
dn—2 = dn—1 + en — en—1; 80 deg(c) < en — €n—1 and deg(d) < 0, a contradiction. 
[This result is essentially due to L. Kronecker, Monatsberichte Königl. preuß. Akad. 
Wiss. (Berlin: 1881), 535-600. It implies the following theorem: “Let u(x) and v(x) 
be relatively prime polynomials over a field and let d < deg(v) < deg(w). If g(x) 
is a polynomial of least degree such that there exist polynomials p(x) and r(x) with 
p(x) u(x) — q(x) v(x) = r(x) and deg(r) = d, then p(x)/q(x) = pn(x)/qn(x) for some n.” 
For if dn-2 > d > dn-1, there are solutions q(x) with deg(q) = en-1 + d —dn-1 < en, 
and we have proved that all solutions of such low degree have the stated property.| 


27. The ideas of answer 4.3.1—40 apply, but in simpler fashion because polynomial 
arithmetic is carry-free; right-to-left division uses 4.7-(3). Alternatively, with large 
values of n, we could divide Fourier transforms of the coefficients, using exercise 4.6.4— 
57 in reverse. 


SECTION 4.6.2 


1. For any choice of k < n distinct roots, there are p’~" monic polynomials having 
those roots at least once. Therefore by the principle of inclusion and exclusion (Section 
1.3.3), the number of polynomials without linear factors is }7,.<,, (?)p”—*(—1)*, and it 
is alternately < and > the partial sums of this series. The stated bounds correspond 
to k < 2 and k < 3. When n > p the probability of at least one linear factor is 
1—(1-1/p)?. The average number of linear factors is p times the average number of 
times x divides u(x), so it is 1 +p t+---+p'"= zal =p"): 

[In a similar way, we find that there is an irreducible factor of degree 2 with 
probability Z rcn) (2-9/2) (1) Bp 2, this probability lies between 3 — ip? and 
4 — 4p ' when n > 2 and it approaches 1 — e-v4i4+ tp ')+O(p-”) as n —> oo. The 
average number of such factors is $ — 1p7?l™ 21] 

Note: Let u(x) be a fixed polynomial with integer coefficients. Peter Weinberger 


has observed that, if u(x) is irreducible over the integers, the average number of linear 


4.6.2 ANSWERS TO EXERCISES 679 


factors of u(x) modulo p approaches 1 as p — oo, because the Galois group of u(x) 
is transitive and the average number of 1-cycles in a randomly chosen element of any 
transitive permutation group is 1. Thus, the average number of linear factors of u(x) 
modulo p is the number of irreducible factors of u(x) over the integers, as p — ov. 
[See the remarks in the answer to exercise 37, and Proc. Symp. Pure Math. 24 (Amer. 
Math. Soc., 1972), 321-332.] 


2. (a) We know that u(x) has a representation as a product of irreducible polynomi- 
als; and the leading coefficients of these polynomials must be units, since they divide the 
leading coefficient of u(x). Therefore we may assume that u(x) has a representation as 
a product of monic irreducible polynomials p1(x)*! ...pr(a)°", where pi(x), ..., pr(x) 
are distinct. This representation is unique, except for the order of the factors, so the 
conditions on u(x), v(x), w(x) are satisfied if and only if 


le1/2] i ler/2] e1 mod 2 oo (x) er mod z 


v(x) = pı (2) -- Pr(2) w(x) = pı (x) 


(b) The generating function for the number of monic polynomials of degree n is 
l+pz+p?z?+---=1/(1—pz). The generating function for the number of polynomials of 
degree n having the form v(x)”, where v(x) is monic, is 1+pz?+p?z*+--- = 1/(1—pz?). 
If the generating function for the number of monic squarefree polynomials of degree 
n is g(z), then we must have 1/(1 — pz) = g(z)/(1 — pz”) by part (a). Hence g(z) = 
(1—pz?)/(1—pz) = 1+pz + (p° — p)? + (p? —p?)2z° +--+. The answer is p” — p"~ for 
n > 2. [Curiously, this proves that u(x) L u'(x) with probability 1—1/p; it is the same 
as the probability that u(x) L v(x) when u(x) and v(x) are independent, by exercise 
4.6.1-5.] 

Note: By a similar argument, every u(x) has a unique representation u(x) w(x)", 
where v(x) is not divisible by the rth power of any irreducible; the number of such 
monic polynomials v(x) is p” — p "T! for n > r. 


3. Let u(x) = u(x)... ur(x). There is at most one such v(x), by the argument of 
Theorem 4.3.2C. There is at least one if, for each j, we can solve the system with 
w;(x) = 1 and w(x) = 0 for k # j. A solution to the latter is vı (x) Į [p2; ur(x), where 
vı(x) and v2(x) can be found satisfying 


v(x) J [pyy Ua(@) + v2(a)uj(z)=1, — deg(v1) < deg(uj), 


by the extension of Euclid’s algorithm (exercise 4.6.13). 
Over the integers we cannot make v(x) = 1 (modulo x) and v(x) = 0 (modulo x—2) 
when deg(v) < 2. 


4. By unique factorization, we have (1 — pz)~' = [T],,5,(1 — 2”)7°"?; after taking 
logarithms, this can be rewritten 7 


In(1/(1 — pz)) = Veet Onp2™/j = ys Gp(z’)/J. 


The stated identity now yields the answer G,(z) = X m>1 (mM) m*In(1/(1 — pz™)), 
from which we obtain anp = Da\nb(n/d) p/n; thus limpo Gnp/p” = 1/n. 
To prove the stated identity, note that 


Enjoi LUIE II = Vn 92) M™ Dam AlN) = gz). 


The numbers anp were first found by Gauss; see his Werke 2, 219-222. 
p y 
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5. Let anpr be the number of monic polynomials of degree n modulo p having exactly 
r irreducible factors. Then Gp(z, w) = >>, „>o Gnpr2”w" = exp(D>, 51 Gp(z*)w*/k) = 
exp(} m>1 amu In(1/(1 — pz~™))); see Eq. 1.2.9-(38). We have 


ae, Anp2” = dG,(z/p, w)/dw Jw=1 = (i Gp(z*/p*)) Gp(z/p, 1) 
= (Sai (1/(1 — p"2")) o(n)/n)/(1 — 2), 
hence Anp = Hn+1/2p+O(p~*) for n > 2. The average value of 2” is [2] Gp(z/p, 2) = 
n+1+(n—1)/p+O(np~). (The variance is of order n?, however: Set w = 4.) 


6. For0<s < p, x—s isa factor of x” — x (modulo p) by Fermat’s theorem. So x? — x 


is a multiple of lem(z — 0,2 —1,...,2 — (p — 1)) = x”. [Note: Therefore the Stirling 
numbers El are multiples of p except when k = 1 or k = p. Equation 1.2.6-(45) shows 
that the same statement is valid for Stirling numbers o} of the other kind.] 

7. The factors on the right are relatively prime, and each is a divisor of u(x), so their 
product divides u(x). On the other hand, u(x) divides 


v(x)? — v(x) = TTocscp(v(@) — 5), 
so it divides the right-hand side by exercise 4.5.2—2. 
8. The vector (18) is the only output whose kth component is nonzero. 


9. For example, start with x + 1 and y + 1; then repeatedly set Riz] < y, 
x + 2x mod 101, y + 51y mod 101, one hundred times. 


10. The matrix Q — J below has a null space generated by the two vectors vil = 
(1,0,0,0,0,0,0,0), v?! = (0,1,1,0,0,1,1,1). The factorization is 


p=2 p=5 

0 0 0 0 0 0 0 0 0000000 
0 1 1 0 0 0 0 0 0400010 
0 0 1 01 0 0 0 02204234 
0 0 0 1 0 0 1 0 014442421 
100 1 0 0 1 0 2223 4 3 2 
1 O0 1 1 1 0 0 0 004013 2 
1 1 O 1 1 1 0 1 


11. Removing the trivial factor x, the matrix Q — I above has a null space generated 
by (1,0,0,0,0,0,0) and (0,3,1,4,1,2,1). The factorization is 


a(x? + 3x4 A)(x° + 20443 +42? +24 3). 


12. If p= 2, (x +1)* = 2441. If p= 8k +1, Q -— I is the zero matrix, so there are 
four factors. For other values of p we have 


p=8k+3 p=8k+5 p=8k+7 
0 0 0 0 o 6 0 0 0 0 0 0 
0-1 0 1 0-2 o0 o0 0—1 0 -1 
Q-I=|o0o 0 -2 0 0 0 0 0 o 0-2 0 
0 1 @ 24 0o OG 0 2 0—1 0 -1 
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Here Q—TJ has rank 2, so there are 4—2 = 2 factors. [But it is easy to prove that zf +1 
is irreducible over the integers, since it has no linear factors and the coefficient of x in 
any factor of degree two must be less than or equal to 2 in absolute value by exercise 20. 
(See also exercise 32, since xf + 1 = We(x).) For all k > 2, H. P. F. Swinnerton-Dyer 
has exhibited polynomials of degree 2” that are irreducible over the integers, but they 
split completely into linear and quadratic factors modulo every prime. For degree 8, his 
example is z — 162° +88a4 +1922? +144, having roots +/2+./3+i [see Math. Comp. 
24 (1970), 733-734]. According to the theorem of Frobenius cited in exercise 37, any 
irreducible polynomial of degree n whose Galois group contains no n-cycles will have 
factors modulo almost all primes.] 


13. Case p = 8k+1: (x + (1+ V—1)/V2)(x + (1 — V—1)/vV2)(a« — (1+ V—1)/V2) x 
(2 —(1- V—1)/V2). Case p = 8k +3: (272+ /—22-1)(2? — vV—2x — 1). Case 
p=8k+5: (a? + /—1)(2? — V—1). Case p = 8k+7: (x? + V2a + 1)(z? — V22 +1). 
The factorization for p = 8k + 7 also holds over the field of real numbers. 

14. Algorithm N can be adapted to find the coefficients of w: Let A be the (r+1) x n 
matrix whose kth row contains the coefficients of v(x)” mod u(x), for 0 < k < r. Apply 
the method of Algorithm N until the first dependence is found in step N3; then the 
algorithm terminates with w(x) = vo + vie +++» + vpz”, where v; is defined in (18). 
At this point 2 < k < r; it is not necessary to know r in advance, since we can check 
for dependency after generating each row of A. 


15. We may assume that u Æ 0 and that p is odd. Berlekamp’s method applied to 
the polynomial x? — u tells us that a square root exists if and only if Q — I = O if 
and only if u®-)/? mod p = 1; but we already knew that. The method of Cantor 
and Zassenhaus suggests that gcd(a —u,(sx+t) ee ae 1) will often be a nontrivial 
factor; and indeed one can show that (p—1)/2+ (0,1, or 2) values of s will succeed. In 
practice, sequential choices seem to work just as well as random choices, so we obtain the 
following algorithm: “Evaluate gcd(x? — u, 2®~)/? —1), gcd(a? — u, (a+1)°-Y/? -1), 
gcd(a? — u, (a + 2)"-)/? — 1), ..., until finding the first case where the gcd has the 
form x +v. Then yu = +v.” The expected running time (with random s) will be 
O(log p)? for large p. 

A closer look shows that the first step of this algorithm succeeds if and only if 
pmod 4 = 3. For if p = 2q + 1 where q is odd, we have x? mod (x? — u) = ut -VD/2g, 
and ged(a? — u, x1 — 1) = «—u't+/? since u1 = 1 (modulo p). In fact, we see that the 
formula yu = tylet/4 mod p gives the square root directly whenever p mod 4 = 3. 

But when p mod 4 = 1, we will have x~!)/? mod (2? —u) = ul?—Y/4 and the ged 
will be 1. The algorithm above should therefore be used only when p mod 4 = 1, and 
the first gcd should then be omitted. 

A direct method that works nicely when p mod 8 = 5 was discovered in the 1990s 
by A. O. L. Atkin, based on the fact that 2°-)/? = —1 in that case: Set v + 
(2u)?-5)/8 mod p and i + (2uv) mod p; then yu = +(uv(i — 1)) mod p, and we also 
have //—1 = +i. [Computational Perspectives on Number Theory (Cambridge, Mass.: 
International Press, 1998), 1-11; see also H. C. Pocklington, Proc. Camb. Phil. Soc. 
19 (1917), 57-59.] 

When p mod 8 = 1, a trial-and-error method seems to be necessary. The following 
procedure due to Daniel Shanks often outperforms all other known algorithms in such 
cases: Suppose p = 2°q+1 where e > 3. 


S1. Choose x at random in the range 1 < x < p, and set z = x’ modp. If 
22°" mod p = 1, repeat this step. (The average number of repetitions will 
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be less than 2. Random numbers will not be needed in steps S2 and $3. In 
practice we can save time by trying small odd prime numbers zx, and stopping 
with z = x7 mod p when p*~)/? mod x = x — 1; see exercise 1.2.4-47.) 


S2. Set yz, ree, x + uD mod p, v 4+ uzmodp, w + ux? mod p. 


S3. If w = 1, stop; v is the answer. Otherwise find the smallest k such that 
w?” mod p is equal to 1. If k = r, stop (there is no answer); otherwise set 
(y,r,v, w) + (y2 7", k, vy? "T, wy?) and repeat step S3. I 

The validity of this algorithm follows from the invariant congruences uw = v’, 

277! = —1, w?™™? = 1 (modulo p). When w ¥ 1, step S3 performs r+2 multiplications 
mod p; hence the maximum number of multiplications in that step is less than ewe 
and the average number is less than (EFA, Thus the running time is O(log p)? for 
steps S1 and S2 plus order e?(log p)? for step S3, compared to just O(log p)? for the 
randomized method based on (21). But the constant factors in Shanks’s method are 
small. [Congressus Numerantium 7(1972), 58-62. A related but less efficient method 
was published by A. Tonelli, Göttinger Nachrichten (1891), 344-346. The first person to 
discover a square root algorithm with expected running time O(log p)’ was M. Cipolla, 


Rendiconti Accad. Sci. Fis. Mat. Napoli 9 (1903), 154-163.] 


16. (a) Substitute polynomials modulo p for integers, in the proof for n = 1. (b) The 
proof for n = 1 carries over to any finite field. (c) Since x = €* for some k, 2?" = g 
in the field defined by f(x). Furthermore, the elements y that satisfy the equation 
yP™ = y in the field are closed under addition, and closed under multiplication; so if 
ap” = x, then € (being a polynomial in x with integer coefficients) satisfies €P™ = €. 


y 


17. If ¿is a primitive root, each nonzero element is some power of €. Hence the order 
must be a divisor of 13? — 1 = 2° - 3 - 7, and y(f) elements have order f. 


f 9(f) f f) f Af) ff) 
1 1 3 2 7 6 21 12 
2 1 6 2 14 6 42 12 
4 2 12 4 28 č 12 84 24 
83 4 24 8 56 24 168 48 


18. (a) pp(pi(unzx))...pp(pr(unz)), by Gauss’s lemma. For example, let 
u(x) = 6x? —327+22-1, v(x) = zx? — 32? 4+ 122 — 36 = (x° + 12)(x — 3); 


then pp(36a” + 12) = 322 + 1, pp(6% — 3) = 2x — 1. (This is a modern version of a 
fourteenth-century trick used for many years to help solve algebraic equations.) 

(b) Let pp(w(urr)) = Wmx™ -+ Wo = w(unx)/c, where c is the content of 
w(un2) as a polynomial in x. Then w(x) = (cù /ug )£™ +--+ + cto, hence cm = ur; 


since Wm is a divisor of un, c is a multiple of u+. 


19. If u(x) = v(x) w(x) with deg(v) deg(w) > 1, then unz” = v(x)w(x) (modulo p). 
By unique factorization modulo p, all but the leading coefficients of v and w are 
multiples of p, and p? divides vowo = uo. 

20. (a) X (auj — uj-1)(&ŭ; — Uj-1) = X (uj — auj-1)(üj — auj-1). (b) We may 
assume that uo 4 0. Let m(u) = Į]; min(1, |a;|) = |uo|/M(u). Whenever |a;| < 1, 
change the factor x — a; to @;x — 1 in u(x); this doesn’t affect ||u||, but it changes 
|uo| to M(u). (c) uj = un YJ ai, ---a%,,_;, an elementary symmetric function, hence 
Juj| < Jun| >> Bi, ...Gi,_; where 6; = max(1,|ai|). We complete the proof by showing 
that when zı > 1, ..., Zn > 1, and zı... £n = M, the elementary symmetric function 
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Onk = YX Citer Lia, is < (ZIM + (or) the value assumed when zı =- = Zn-1 = 
1 and zn = M. (For if xı < +--+ < £n < M, the transformation £n <— TZn-1gn, 
£n—1 + 1 increases onk by O(n—2)(k—1) (En — 1)(£n-1 — 1), which is positive.) (d) |v;| < 
(7 Mo) [um| < ("7 Ma) |un| since M(v) < M (u) and |u| < Jun]. 
[M. Mignotte, Math. Comp. 28 (1974), 1153-1157.] 

Notes: This solution shows that (=) M(u) + ay) |un| is an upper bound, so we 
would like to have a better estimate of M(u). Several methods are known [W. Specht, 
Math. Zeit. 53 (1950), 357-363; Cerlienco, Mignotte, and Piras, J. Symbolic Comp. 
4 (1987), 21-33]. The simplest and most rapidly convergent is perhaps the following 
procedure [see C. H. Graeffe, Auflösung der höheren numerischen Gleichungen (Zürich: 
1837)|: Assuming that u(x) = un(x — a1)... (£ — an), let G(x) = u(./x)u(-Yz) = 
(—1)”u? (x —aî)...(x—a2). Then M (u)? = M(@) < ||a||. Hence we may set c + |jull, 
v 4+ u/c, t + 0, and then repeatedly set t + t+ 1, c + lêle, v + ĉô/ljêll. 
The invariant relations M(u) = cM(v)!/* and |lv|| = 1 guarantee that M(u) < c 
at each step of the iteration. Notice that when v(x) = vo(z?) + zvı(x°), we have 
ô(x) = vo(x)? — zvı(x)?. It can be shown that if each |a;| is < p or > 1/p, then 
M(u) = ||ul|(1 + O(p)); hence c will be M(u)(1 + O(p2')) after t steps. 

For example, if u(x) is the polynomial of (22), the successive values of c for t = 0, 
1, 2, ... turn out to be 10.63, 12.42, 6.85, 6.64, 6.65, 6.6228, 6.62246, 6.62246, .... 
In this example p ~ .90982. Notice that convergence is not monotonic. Eventually 
v(x) will converge to the monomial z”, where m is the number of roots with |a,;| < 1, 
assuming that |a;| # 1 for all j; in general, if there are k roots with |a;| = 1, the 
coefficients of x” and «™*t* will not approach zero, while the coefficients of higher and 
lower powers of x will. 

A famous formula due to Jensen [Acta Math. 22 (1899), 359-364] proves that M(u) 
is the geometric mean of |u(x)| on the unit circle, namely exp(5= i In| f(e*”)| dé). 
Exercise 21(a) will show, similarly, that ||u|| is the root-mean-square of |u(z)| on the 
unit circle. The inequality M (u) < ||ul|, which goes back to E. Landau [Bull. Soc. Math. 
de France 33 (1905), 251-261], can therefore be understood as a relation between mean 
values. The number M (u) is often called the Mahler measure of a polynomial, because 
Kurt Mahler used it in Mathematika 7 (1960), 98-100. Incidentally, Jensen also proved 
that + J e™? In| f(e"?)| do = — X ci atj (2m max(|a;|,1)?") when m > 0. 

21. (a) The coefficient of apbqCrds is zero on both sides unless p +s = q +r. And 
when this condition holds, the coefficient on the right is (p + s)!; on the left it is 


2o M (PTS) ar! = (a+r). 


[B. Beauzamy and J. Dégot, Trans. Amer. Math. Soc. 345 (1995), 2607-2619; D. Zeil- 
berger, AMM 101 (1994), 894-896.] 

(b) Let ap = Up, bq = Wa, Cr = Ur, ds = Ws. Then the right side of (a) is B(u), 
and the left side is a sum of nonnegative terms for each j and k. If we consider only 
the terms where X j is the degree of v, the terms vp/(p —j)! vanish except when p = j. 
Those terms therefore reduce to 


1 i 
5 jk |uj we j! k! |” = B(v)B(w). 
= ik! 


[B. Beauzamy, E. Bombieri, P. Enflo, and H. Montgomery, J. Number Theory 36 
(1990), 219-245.] 
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(c) Adding a new variable, if needed to make everything homogeneous, does not 
change the relation u = vw. Thus if v and w have total degrees m and n, respectively, 
we have (m + n)! [u]? > m! [v]? n! [w]?; in other words, [v][w] < (%4) [ul]. 

Incidentally, one nice way to think of the Bombieri norm is to imagine that the 
variables are noncommutative. For example, instead of 3xy? — z?w? we could write 
2 ayyy t Syxcyy t Šyyzy t 2 yyy izzww tzwzw łzwwz twzzw iwzwz iwwzz. 
Then the Bombieri norm is the || || norm on the new coefficients. Another interesting 


formula, when u is homogeneous of degree n, is 


(d) The one-variable case corresponds to t = 2. Suppose u = vw where v is 
homogeneous of degree m in t variables. Then |vuk|?k!/m! < [v]? for all k, and k! > 
(m/t)!* since log T(x) is convex for æ > 0; therefore |vk|? < m! [v]?/(m/t)!’. We can 
assume that m! [v]?/(m/t)!* < m! [w]?/(m'/t)!*, where m’ = n — m is the degree of w. 
Then 


url? < m! [o]? / (mH < m Pm P lollo] / (mAP mM A < n [ul /(n/2t)!* . 


(A better bound is obtained if we maximize the next-to-last expression over all de- 
grees m for which a factor has not been ruled out.) The quantity n!1/4/(n/2t)!"/? is 
cz (2t)"/4n—t-D/8 + O(+)), where c = 1/8 7 —(2t-1)/8 44/4 is ~ 1.004 when t = 2. 

Notice that we have not demonstrated the existence of an irreducible factor with 
such small coefficients; further splitting may be needed. See exercise 41. 

(e) (ul? = Xa (2) Ge) = Xe a Cr) = 4°72) = Ven + O(n”), TE 
v(x) = (x — 1)” and w(x) = (a +1)”, we have [v]? = [w]? = 2”; hence the inequality 
of (c) is an equality in this case. 

(f) Let u and v be homogeneous of degree m and n. Then 


erep EES (ant) PG) sure 


by Cauchy’s inequality. [B. Beauzamy, J. Symbolic Comp. 13 (1992), 465-472, Propo- 
sition 5.] 

(g) By exercise 20, (nyo) M (u)? < (inz) lull? = (aay È; lel? < 

ul? = X; (o) as |? < ar (") M(u)? = 2” M(u)?. The upper inequality also follows 
from (f), for if u(x) = un[[;j-(£ — aj) we have [u]? < lun? TIE - aj}? = 
unl? Tij- (1 + lasl?) < [un]? Tj- (2 max(1, |aj|)*) = 2"M(u)?. 
22. More generally, assume that u(x) = v(x) w(x) (modulo q), a(x) v(x)+b(x) w(x) = 1 
modulo p), c- £(v) = 1 (modulo r), deg(a) < deg(w), deg(b) < deg(v), and deg(u) = 
deg(v) + deg(w), where r = gcd(p,q) and p,q needn’t be prime. We shall construct 
polynomials V(x) = v(x) and W(x) = w(x) (modulo q) such that u(x) = V(x)W (£) 
modulo qr), (V) = (v), deg(V) = deg(v), deg(W) = deg(w); furthermore, if r is 
prime, the results will be unique modulo qr. 

The problem asks us to find v(x) and w(x) with V(x) = v(x) + qv(x), W(x) = 
w(x) + qw(a), deg(v) < deg(v), deg(w) < deg(w); and the other condition 


(v(x) + go(x)) (w(x) + qw(x)) = u(x) (modulo qr) 
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is equivalent to w(x) v(x) + U(x) w(x) = f(x) (modulo r), where f(x) satisfies u(x) = 
v(x) w(x) + gf (x) (modulo gr). We have 

(a(x) f(x) + t(x)w(a))v(a) + (b(a) f(x) — t(x) v(x))w(x) = f(x) (modulo r) 
for all t(x). Since (v) has an inverse modulo r, we can find a quotient t(x) by 
Algorithm 4.6.1D such that deg(bf — tv) < deg(v); for this t(x), deg(af+tw) < deg(w), 
since we have deg(f) < deg(u) = deg(v) + deg(w). Thus the desired solution is 
B(x) = O(a) f(x) — t(x)ve) = Ya) f(a) mod vlz), W(x) = a(z) f(z) + t(x)w(a). If 
v(x), w(x)) is another solution, we have (w(x) — w(x))v(x) = (v(x) — a(x))w(z) 
(modulo r). Thus if r is prime, v(x) must divide v(x) — v(x); but deg(v — v) < deg(v), 
so U(x) = U(x) and w(x) = w(x). 

If p divides q, so that r = p, our choices of V (x) and W (x) also satisfy a(x) V (x) + 
b(x)W (x) = 1 (modulo p), as required by Hensel’s Lemma. 

For p = 2, the factorization proceeds as follows (writing only the coefficients, 
and using bars for negative digits): Exercise 10 says that vi(z) = (111), wi(x) = 
(1110011) in one-bit two’s complement notation. Euclid’s extended algorithm yields 
a(x) = (100001), b(x) = (10). The factor v(x) = x? + civ + co must have |ci| < 
|1 + V113] = 11, |co| < 10, by exercise 20. Three applications of Hensel’s lemma 
yield va(xz) = (131), wa(x) = (13854435). Thus cı = 3 and co = —1 (modulo 16); 
the only possible quadratic factor of u(x) is z? + 32 — 1. Division fails, so u(x) is 
irreducible. (Since we have now proved the irreducibility of this beloved polynomial by 
four separate methods, it is unlikely that it has any factors.) 

Hans Zassenhaus has observed that we can often speed up such calculations by 
increasing p as well as q: When r = p in the notation above, we can find A(x), B(x) such 
that A(x) V(x) + B(x)W(a) = 1 (modulo p°), namely by taking A(x) = a(x) + pa(x), 
B(x) = b(x) + pb(x), where a(x) V(x) + b(a)W(x) = g(x) (modulo p), a(x) V(x) + 
b(x)W (x) = 1—pg(x) (modulo p°). We can also find C with (V)C = 1 (modulo p°). In 
this way we can lift a squarefree factorization u(x) = v(x) w(x) (modulo p) to its unique 
extensions modulo p?, p+, pë, p!6, etc. However, this “accelerated” procedure reaches a 
point of diminishing returns in practice, as soon as we get to double-precision moduli, 
since the time for multiplying multiprecision numbers in practical ranges outweighs the 
advantage of squaring the modulus directly. From a computational standpoint it seems 
best to work with the successive moduli p, p°, p*, p°, ..., p, př tS, p@™+?*, p¥t*, . 
where E is the smallest power of 2 with p” greater than single precision and e is the 
largest integer such that p° has single precision. 


ney 


“Hensel’s Lemma” was actually invented by C. F. Gauss about 1799, in the draft 
of an unfinished book called Analysis Residuorum, §373-374. Gauss incorporated 
most of the material from that manuscript into his Disquisitiones Arithmeticee (1801), 
but his ideas about polynomial factorization were not published until after his death 
[see his Werke 2 (Göttingen, 1863), 238]. Meanwhile T. Schönemann had indepen- 
dently discovered the lemma and proved uniqueness [Crelle 32 (1846), 93-105, 859]. 
Hensel’s name was attached to the method because it is basic to the theory of p- 
adic numbers (see exercise 4.1—-31). The lemma can be generalized in several ways. 
First, if there are more factors, say u(x) = v1(x)v2(x)v3(x) (modulo p), we can find 
ai(x), a2(x), a3(a) such that aı(x)və(x)vs(x) + a2(x)v1 (x) u3(@) + a3(x) v1(x) v2(x) = 
1 (modulo p) and deg(a;) < deg(v;). (In essence, 1/u(x) is expanded in partial 
fractions as X ai(x)/vi(x).) An exactly analogous construction now allows us to 
lift the factorization without changing the leading coefficients of vı and v2; we take 
v(x) = a(x) f(x) mod vı (x), U(x) = ae(x) f(x) mod v2(x), etc. Another important 
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generalization is to several simultaneous moduli, of the respective forms p°, (£2 —a2)"?, 
.., (wt — at)"*, when performing multivariate gcds and factorizations. See D. Y. Y. 
Yun, Ph.D. Thesis (M.I.T., 1974). 


23. The discriminant of pp(u(z)) is a nonzero integer (see exercise 4.6.1-12), and 
there are multiple factors modulo p if and only if p divides the discriminant. [The 
factorization of (22) modulo 3 is (a + 1)(a? — a — 1)? (£? + z? — x +1); squared factors 
for this polynomial occur only for p = 3, 23, 233, and 121702457. It is not difficult to 
prove that the smallest prime that is not unlucky is at most O(n log Nn), if n = deg(w) 


and if N bounds the coefficients of u(x).] 


24. Multiply a monic polynomial with rational coefficients by a suitable nonzero inte- 
ger, to get a primitive polynomial over the integers. Factor this polynomial over the 
integers, and then convert the factors back to monic. (No factorizations are lost in this 
way; see exercise 4.6.1-8.) 


25. Consideration of the constant term shows there are no factors of degree 1, so if 
the polynomial is reducible, it must have one factor of degree 2 and one of degree 3. 
Modulo 2 the factors are g(x + 1)?(a? + 2+ 1); this is not much help. Modulo 3 the 
factors are (2 + 2)?(a? + 2a + 2). Modulo 5 they are (a? + x + 1)(x? + 4x + 2). So we 
see that the answer is (x? + x + 1)(x? — x + 2). 

26. Begin with D <+ (0...01), representing the set {0}. Then for 1 < j < r, set 
D 4+ D|(D « d;), where | denotes bitwise “or” and D < d denotes D shifted left 


d bit positions. (Actually we need only work with a bit vector of length [(n + 1)/2], 
since n — m is in the set if and only if m is.) 


27. Exercise 4 says that a random polynomial of degree n is irreducible modulo p with 
rather low probability, about 1/n. But the Chinese remainder theorem implies that a 
random monic polynomial of degree n over the integers will be reducible with respect 
to each of k distinct primes with probability about (1—1/n)*, and this approaches zero 
as k — oo. Hence almost all polynomials over the integers are irreducible with respect 
to infinitely many primes; and almost all primitive polynomials over the integers are 
irreducible. [Another proof has been given by W. S. Brown, AMM 70 (1963), 965-969.] 


28. See exercise 4; the probability is [z"] (1+aipz/p)(1+a2p27/p”)(1+a3pz3/p’) ..., 


which has the limiting value g(z) = (1+ z)(1+ $2°)(1+ 42°).... Forl <n< 10 the 


1 5 7 37 79 173 101 127 1033 L Dn On 
answers are 1, 3, § 73> 60 120° 280’ 168’ 210° 1680" [Let f(y) =In(1+y)—y = Oly’). 
We have 


9(2) = exp(Dins1 27/2 + Vins f(2"/0)) = h(z)/(1 — 2), 
and it can be shown that the limiting probability is h(1) = exp(>>,,s, f(1/n)) = 
e 7’ ~ .56146 as n — oo. Indeed, N. G. de Bruijn has established the asymptotic 
formula limpo anp = e7 +e77/n + O(n logn). [See D. H. Lehmer, Acta Arith. 21 
(1972), 379-388; D. H. Greene and D. E. Knuth, Math. for the Analysis of Algorithms 
(Boston: Birkhauser, 1981), §4.1.6.] On the other hand the answers for 1 < n < 10 
when p = 2 are smaller: 1, L, 3, L, L, L, a, I, 12, ae. A. Knopfmacher and 
R. Warlimont [Trans. Amer. Math. Soc. 347 (1995), 2235-2243] have shown that for 
fixed p the probability is cp+O(1/n), where cp = Į [m> e 1/™(1+amp/p™), c2 ~ .397.] 


29. Let qi(x) and q2(x) be any two of the irreducible divisors of g(x). By the Chinese 
remainder theorem (exercise 3), choosing a random polynomial t(x) of degree < 2d is 
equivalent to choosing two random polynomials tı (x) and t2(a) of degrees < d, where 
ti(a) = t(x) mod qi(x). The gcd will be a proper factor if t1(«)(»*-))/2 mod qı (x) = 1 
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and to(a)(»*-1)/2 mod qi(x) Æ 1, or vice versa, and this condition holds for exactly 
2((p* — 1)/2)((p* + 1)/2) = (p? — 1)/2 choices of tı (x) and te(a). 

Notes: We are considering here only the behavior with respect to two irreducible 
factors, but the true behavior is probably much better. Suppose that each irreducible 
factor qi(x) has probability 4 of dividing t(a)(P*—))/2 — 1 for each t(x), independent of 
the behavior for other q;(x) and t(x); and assume that g(x) has r irreducible factors in 
all. Then if we encode each qi(x) by a sequence of Os and 1s according as qi(x) does or 
doesn’t divide t(x)(»*-1)/2 — 1 for the successive t’s tried, we obtain a random binary 
trie with r lieves (see Section 6.3). The cost associated with an internal node of this 
trie, having m lieves as descendants, is O(m? (log p)); and the solution to the recurrence 
An = (2) 4+Qi-” Dy G) Ay is An = 2(3) by exercise 5.2.2-36. Hence the sum of costs 
in the given random trie — representing the expected time to factor g(x) completely— 
is O(r? (log p)?) under this plausible assumption. The plausible assumption becomes 
rigorously true if we choose t(x) at random of degree < rd instead of restricting it to 
degree < 2d. 


30. Let T(x) =a2+a2?+---+axP°* be the trace of x and let v(x) = T(t(x)) mod q(x). 
Since t(x)?* = t(x) in the field of polynomial remainders modulo q(x), we have u(x)? = 
v(x) in that field; in other words, v(x) is one of the p roots of the equation y? — y = 0. 
Hence v(x) is an integer. 

It follows that []?—5 gcd(ga(x), T(t(x)) — s) = ga(a). In particular, when p = 2 we 
can argue as in exercise 29 that gcd(ga(x), T(t(x))) will be a proper factor of ga(x) with 
probability > 5 when ga(z) has at least two irreducible factors and t(x) is a random 
binary polynomial of degree < 2d. 

[Note that T(¢(x)) mod g(x) can be computed by starting with u(x) + t(x) and 
setting u(x) + (t(x) + u(x)”) mod g(x) repeatedly, d — 1 times. The method of this 
exercise is based on the polynomial factorization 2?* — 2 = ae (T(x) — s), which 
holds for any p, while formula (21) is based on the polynomial factorization rr — r= 
a(a(P*-1)/2 + 1)(a(e*-1)/2 — 1) for odd p.] 

The trace was introduced by Richard Dedekind, Abhandlungen der Königl. Gesell- 
schaft der Wissenschaften zu Göttingen 29 (1882), 1-56. The technique of calculating 
gcd( f(x), T(x) — s) to find factors of f(x) can be traced to A. Arwin, Arkiv för Mat., 
Astr. och Fys. 14,7 (1918), 1-46; but his method was incomplete because he did not 
consider T(t(x)) for t(x) 4 x. A complete factorization algorithm using traces was 
devised later by R. J. McEliece, Math. Comp. 23 (1969), 861-867; see also von zur 
Gathen and Shoup, Computational Complexity 2 (1992), 187-224, Algorithm 3.6, for 
asymptotically fast results. 

Henri Cohen has observed that for p = 2 it suffices to test at most d special 
cases t(x) = a, x, ..., 77-1 when applying this method. One of these choices of 
t(x) is guaranteed to split ga(x) whenever ga is reducible, because we can obtain the 
effects of all polynomials t(x) of degree < 2d from these special cases using the facts that 
T(t(x)?) = T(t(x)) and T(u(x)+t(x)) = T(u(x)) +T(t(x)) (modulo ga(zx)). [A Course 
in Computational Algebraic Number Theory (Springer, 1993), Algorithm 3.4.8.] 


d-1 


31. If a is an element of the field of p? elements, let d(a) be the degree of a, namely 
the smallest exponent e such that a?° = a. Then consider the polynomial 


Pa (£) = (a — a) (£ — a”)... (x — aP) = qa(x)/™, 


where qa() is an irreducible polynomial of degree d(a). As a runs through all elements 
of the field, the corresponding qa(x) runs through every irreducible polynomial of 
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degree e dividing d, where every such irreducible occurs exactly e times. We have 
(a + t)(*-1)/2 mod qa(x) = 1 if and only if (a + t)@*-D/2 = 1 in the field. If t is an 
integer, we have d(a + t) = d(a), hence n(p, d) is d~' times the number of elements a 
of degree d such that a(e*-1)/2 = 1. Similarly, if tı Æ t2 we want to count the number 
of elements of degree d such that (a + t1)(®"-1)/2 = (a + tz)(®*-1)/2, or equivalently 
((a+t1)/(a+t2))(*-)/2 = 1. As a runs through all the elements of degree d, so does 
the quantity (a + t1)/(a + t2) = 1 + (tı — t2)/(a + t2). 

[We have n(p,d) = +d7" DAC) + (-1)°)u(c)(p/* — 1), which is about half the 
total number of irreducibles— exactly half, in fact, when d is odd. This proves that 
gcd(ga(a), (x + t)(P*-)/2 — 1) has a good chance of finding factors of ga(a) when t is 
fixed and ga(x) is chosen at random; but a randomized algorithm is supposed to work 
with guaranteed probability for fixed ga(x) and random t, as in exercise 29.] 


32. (a) Clearly x” —1 = J] ,),,Wa(x), since every complex nth root of unity is a 
primitive dth root for some unique d\n. The second identity follows from the first; and 
W,(x) has integer coefficients since it is expressed in terms of products and quotients 
of monic polynomials with integer coefficients. 

(b) The condition in the hint suffices to prove that f(x) = W(x), so we shall take 
the hint. When p does not divide n, we have x” — 1 L ng”! modulo p, hence x” — 1 
is squarefree modulo p. Given f(x) and Ç as in the hint, let g(x) be the irreducible 
factor of W(x) such that g(¢?) = 0. If g(x) # f(x) then f(x) and g(x) are distinct 
factors of Yn (x), hence they are distinct factors of x” —1, hence they have no irreducible 
factors in common modulo p. However, ¢ is a root of g(x”), so ged(f(x),g(x”)) #1 
over the integers, hence f(x) is a divisor of g(x”). By (5), f(x) is a divisor of g(x)”, 
modulo p, contradicting the assumption that f(x) and g(x) have no irreducible factors 
in common. Therefore f(x) = g(x). [The irreducibility of WY, (x) was first proved for 
prime n by C. F. Gauss in Disquisitiones Arithmeticee (Leipzig: 1801), Art. 341, and 
for general n by L. Kronecker, J. de Math. Pures et Appliquées 19 (1854), 177-192.] 

(c) W(x) = x — 1; and when p is prime, p(x) =l+a4+---+a?"'. Ifn>1 
is odd, it is not difficult to prove that Won(x) = Y%(—«). If p divides n, the second 
identity in (a) shows that Ypn(£) = Y,(x?). If p does not divide n, we have Wp, (x) = 
Y,(2?)/Y,(x). For nonprime n < 15 we have Wa(x) = 2? +1, Ye(x) = z? -xz +1, 
We(x) = xt +1, Wo(x) = r? +r? +1, No(x) = zt — r’ +a? —2 41, Vilx) = r-r’ +1, 
Wia(e) = zê — r’ + zt — r? + x? —2@ 41, s(x) = rë — a 40° — et 4+ a3 —2 41. 
[The formula W(x) = (1 + £? +- -- + £7 dP) (æ — 1)/(x7 — 1) can be used to show 
that Wpq(x) has all coefficients +1 or 0 when p and q are prime; but the coefficients of 
Wpqar (x) can be arbitrarily large.] 


33. False; we lose all p; with e; divisible by p. True if p > deg(wu). [See exercise 36.] 


34. [D. Y. Y. Yun, Proc. ACM Symp. Symbolic and Algebraic Comp. (1976), 26-35.] 
Set (t(x),v1(x), wi(x)) <— GCD(u(z),u'(z)). If t(x) = 1, set e + 1; otherwise set 
(ui(x), vier (x), wi+1(£)) | GCD(v;(x), wi(x) — vi(x)) for i = 1, 2, ..., e — 1, until 
finding we(x) — v(x) = 0. Finally set ue(x) 4+ velz). 

To prove the validity of this algorithm, we observe that it computes the polyno- 
mials t(x) = uz(x)us(x£) ua (£)? ..., vila) = ui(x) wi41 (a) wi42(x)..., and 


wy (a) = u; (x) Ui41(@) Ue42(@) . + 2ug (x) uE) u22) ...+3us(x) uipa (E) uial) oH 


We have t(x) L wi(a), since an irreducible factor of ui(a) divides all but the ith 
term of w(x), and it is relatively prime to that term. Furthermore we clearly have 
Ui (x) is Viti1(2). 
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[Although exercise 2(b) proves that most polynomials are squarefree, nonsquarefree 
polynomials actually occur often in practice; hence this method turns out to be quite 
important. See Paul S. Wang and Barry M. Trager, SICOMP 8 (1979), 300-305, 
for suggestions on how to improve the efficiency. Squarefree factorization modulo p 
is discussed by Bach and Shallit, Algorithmic Number Theory 1 (MIT Press, 1996), 
answer to exercise 7.27.] 


35. We have w;(x) = gcd(u;(x), v} (x)) - gcd(uj 44 (x), vj(x)), where 
uj (x) = uj(x)uj+1 (£)... and vj (x) = u;(@)vj4i(x)... 


[Yun notes that the running time for squarefree factorization by the method of exer- 
cise 34 is at most about twice the running time to calculate gcd(u(zx), u'(x)). Further- 
more if we are given an arbitrary method for discovering squarefree factorization, the 
method of this exercise leads to a gcd procedure. (When u(x) and v(x) are squarefree, 
their gcd is simply w(x) where w(x) = u(x)v(x) = wi(x)we(x)?; the polynomials 
uj(x), v;(x), uj(x), and vj (x) are all squarefree.) Hence the problem of converting 
a primitive polynomial of degree n to its squarefree representation is computationally 
equivalent to the problem of calculating the gcd of two nth degree polynomials, in the 
sense of asymptotic worst-case running time.] 


36. Let U;(x) be the value computed for “uj(x)” by the procedure of exercise 34. If 
deg(U1) +2 deg(U2) +--+ = deg(w), then u;(x) = U;(x) for all j. But in general we will 
have e < p and U;(x) = [],39 Uj-+pr (a) for 1 < j < p. To separate these factors further, 
we can calculate t(x)/(U2(x)U3 (£)? ...Up—1(a)?~?) = Ij>p ui (x)Pli/P] = z(x?). After 
recursively finding the squarefree representation of z(x) = (zı(£), z2(£),...), we will 
have 2x(@) = [ [o<j<p Ui+pr(x), so we can calculate the individual u;(x) by the formula 
gcd(U; (x), zx(x)) = uj+pe(x) for 1 < j < p. The polynomial upx(x) will be left when 
the other factors of z,(x) have been removed. 

Note: This procedure is fairly simple but the program is lengthy. If one’s goal is 
to have a short program for complete factorization modulo p, rather than an extremely 
efficient one, it is probably easiest to modify the distinct-degree factorization routine 
so that it casts out ged(a?* — x, u(a)) several times for the same value of d until the 
gcd is 1. In this case you needn’t begin by calculating gcd(u(zx), u’(x)) and removing 
multiple factors as suggested in the text, since the polynomial xP" — x is squarefree. 


37. The exact probability is Th j>1(@ie/p?)*4/ky!, where k; is the number of d; that are 
equal to j. Since ajp/p? ~% 1/j by exercise 4, we get the formula of exercise 1.3.3-21. 
Notes: This exercise says that if we fix the prime p and let the polynomial u(x) 
be random, it will have a certain probability of splitting in a given way modulo p. 
A much harder problem is to fix the polynomial u(x) and to let p be “random”; it 
turns out that the same asymptotic result holds for almost all u(x). G. Frobenius 
proved in 1880 that the integer polynomial u(x) splits modulo p into factors of degrees 
di, ..., dr, when p is a large prime chosen at random, with probability equal to the 
number of permutations in the Galois group G of u(x) having cycle lengths {d1,...,d,} 
divided by the total number of permutations in G. [If u(x) has rational coefficients and 
distinct roots 1, ..., En over the complex numbers, its Galois group is the (unique) 
group G of permutations such that the polynomial Tlpa)...ptnyea(? + Epayyr Fe + 
Epcnyyn) = U(z,y1,---, Yn) has rational coefficients and is irreducible over the rationals; 
see G. Frobenius, Sitzungsberichte Königl. preuB. Akad. Wiss. (Berlin: 1896), 689-703. 
The linear mapping x +> 2? is traditionally called the Frobenius automorphism because 
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of this famous paper. | Furthermore B. L. van der Waerden proved in 1934 that almost 
all polynomials of degree n have the set of all n! permutations as their Galois group 
[Math. Annalen 109 (1934), 13-16]. Therefore almost all fixed irreducible polynomials 
u(x) will factor as we might expect them to, with respect to randomly chosen large 
primes p. See also N. Chebotarev, Math. Annalen 95 (1926), for a generalization of 
Frobenius’s theorem to conjugacy classes of the Galois group. 


38. The conditions imply that when |z| = 1 we have either |u,_2z”~? +--+ + uo| < 
lun-1|— 1 < |z” + Un—12" "| or |un—32"° +- + uo| <Un-2-1< |e” + un—22”7?’]. 
Therefore by Rouché’s theorem [J. École Polytechnique 21,37 (1858), 1-34], u(z) has 
at least n — 1 or n — 2 roots inside the circle |z| = 1. If u(z) is reducible, it can be 
written v(z)w(z) where v and w are monic integer polynomials. The products of the 
roots of v and of w are nonzero integers, so each factor has a root of absolute value > 1. 
Hence the only possibility is that v and w both have exactly one such root and that 
Un—1 = 0. These roots must be real, since the complex conjugates are roots; hence 
u(z) has a real root zo with |zo| > 1. But this cannot be, for if r = 1/zo we have 
0 = |1 + unr? +--+ + uor”| > 1+ unr? — |un—3|r? — +++ — |uolr™ > 1. [O. Perron, 
Crelle 132 (1907), 288-307; for generalizations, see A. Brauer, Amer. J. Math. 70 
(1948), 423-432, 73 (1951), 717-720.] 


39. First we prove the hint: Let u(x) = a(x — a1) . . . (£x — an) have integer coefficients. 
The resultant of u(x) with the polynomial y—¢(x) is a determinant, so it is a polynomial 
re(y) = a8 (y—t(ar))...(y—t(an)) with integer coefficients (see exercise 4.6.1-12). 
If u(x) divides v(t(x)) then v(t(a1)) = 0, hence r:(y) has a factor in common with 
u(y). So if v is irreducible, we have deg(u) = deg(r:) > deg(v). 

Given an irreducible polynomial u(x) for which a short proof of irreducibility is 
desired, we may assume that u(x) is monic, by exercise 18, and that deg(u) > 3. The 
idea is to show the existence of a polynomial t(x) such that v(y) = re(y) is irreducible 
by the criterion of exercise 38. Then all factors of u(x) divide the polynomial v(t(z)), 
and this will prove that u(x) is irreducible. The proof will be succinct if the coefficients 
of t(x) are suitably small. 

The polynomial u(y) = (y — 61)... (y — Gn) can be shown to satisfy the criterion of 
exercise 38 ifn > 3 and £1... Bn Æ 0, and if the following “smallness condition” holds: 
|8;| < 1/(4n) except when j = n or when £; = 8,, and |R8;| < 1/(4n). The calculations 
are straightforward, using the fact that |vo| + +--+ |un| < (14+ |81|)-..(1+|Bnl)- 

Let ai, ..., Qr be real and a,41, ..., @r+s be complex, where n = r + 2s and 
Qr+s+j = Qr+j for 1 < j < s. Consider the linear expressions S;(ao,...,@n—1) defined 
to be R(X Zp aias) for 1 < j < r+s and S(O") ajiai) for r+s <j <n. If 


0 < a; < band B = [max; Z] a |ai|?], we have |S; (a1,...,an-1)| < bB. Thus if we 


choose b > (16nB)”"~', there must be distinct vectors (ao,...,@n—1) and (ab, ..., ah—1) 
such that |8nS;(ao,...,an-1)| = [8nS;(ab,...,ah—1)]| for 1 < j < n, since there 
are b” vectors but at most (16nbB)"~' < b” possible (n — 1)-tuples of values. Let 
t(x) = (ao—ab)+: -+ (an-1 —ah—1)£”7" and 8; = t(a;). Then the smallness condition 
is satisfied. Furthermore 8; # 0; otherwise t(x) would divide u(x). [J. Algorithms 2 
(1981), 385-392.] 


40. Given a candidate factor v(z) = a” + aq-1z4~' +--++ ao, change each a; to a 
rational fraction (modulo p°), with numerators and denominators < B. Then multiply 
by the least common denominator, and see if the resulting polynomial divides u(x) 
over the integers. If not, no factor of u(x) with coefficients bounded by B is congruent 
modulo p° to a multiple of v(x). 
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41. David Boyd notes that 42° + 42° + z4 + 4a? + 4 = (2x4 + 4r? + 5a? + 4x + 2) x 
(2x4 4r? + 5a? — 4r 4 2), and he has found examples of higher degree to prove that 
c must be > 2 if it exists. 


SECTION 4.6.3 
1. x, where m = 2l8”] is the highest power of 2 less than or equal to n. 


2. Assume that x is input in register A, and n in location NN; the output is in 
register X. 


01 Ai ENTX 1 1 A1. Initialize. 

02 STX Y 1 Yel. 

03 STA Z 1 Zer. 

04 LDA NN 1 Nen. 

05 JAP 2F 1 To A2. 

06 JMP DONE 0 Otherwise the answer is 1. 
07 5H SRB 1 L+1-K 

08 STA N L+1-K Ne [|N/2]. 

09 A5 LDA Z L A5. Square Z. 

10 MUL Z L 

11 STX Z L Z+ Z x Zmodw. 
12 A2 LDA N L A2. Halve N. 

18 2H JAE 5B L+1 To A5 if N is even. 
14 SRB 1 K 

15 A4 JAZ 4F K Jump if N =1. 

16 STA N K-—-1 N + |N/2]. 

17 A3 LDA Z K-1 A3. Multiply Y by Z. 
18 MUL Y K-—1 

19 STX Y K-1 Y + Z xY modw. 
20 JMP A5 K-1 To A5. 

21 4H LDA Z 1 

22 MUL Y 1 Do the final multiplication. | 


The running time is 21L + 16K + 8, where L = X(n) is one less than the number of 
bits in the binary representation of n, and K = v(n) is the number of 1-bits in that 
representation. 

For the serial program, we may assume that n is small enough to fit in an index 
register; otherwise serial exponentiation is out of the question. The following program 
leaves the output in register A: 


01 Si LD1 NN 1 ril n. 

02 STA X 1 Xer. 

03 JMP 2F 1 

04 1H MUL X N-1 rAx X modw 
05 SLAX 5 N-1 a rA. 

06 2H DEC1 1 N rll + rll —-1. 


07 JiP 1B N Multiply again if ril >0. J 


The running time for this program is 14N — 7; it is faster than the previous program 
when n < 7, slower when n > 8. 
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3. The sequences of exponents are: (a) 1, 2, 3, 6, 7, 14, 15, 30, 60, 120, 121, 242, 
243, 486, 487, 974, 975 [16 multiplications]; (b) 1, 2, 3, 4, 8, 12, 24, 36, 72, 108, 
216, 324, 325, 650, 975 [14 multiplications]; (c) 1, 2, 3, 6, 12, 15, 30, 60, 120, 240, 
243, 486, 972, 975 [13 multiplications]; (d) 1, 2, 3, 6, 12, 15, 30, 60, 75, 150, 300, 
600, 900, 975 [13 multiplications]. [The smallest possible number of multiplications is 
12; this is obtainable by combining the factor method with the binary method, since 
975 = 15- (2° + 1).] 

4. (777777) = 2" — 1. 

5. T1. [Initialize.] Set LINKU[j] < 0 for 0 < j < 2", and set k + 0, LINKR[0] + 1, 
LINKR(1] + 0. 

T2. [Change level.] (Now level k of the tree has been linked together from left to 
right, starting at LINKR(O].) If k = r, the algorithm terminates. Otherwise set 
n < LINKR[O], m + 0. 

T3. [Prepare for n.] (Now n is a node on level k, and m points to the rightmost 
node currently on level k + 1.) Set q 0, seen. 

T4. [Already in tree?] (Now s is a node in the path from the root to n.) If 
LINKU[n + s] 4 0, go to T6 (the value n + s is already in the tree). 

T5. [Insert below n.] If q = 0, set m’ + n+ s. Then set LINKR[n + s] + q, 
LINKU[n + s] Hn, q} n+s. 


T6. [Move up.] Set s + LINKU[s]. If s 40, return to T4. 
T7. [Attach group.] If q 4 0, set LINKR[m] + q, m + m’. 
T8. [Move n.] Set n + LINKR[n]. If n 4 0, return to T3. 
T9. [End of level.] Set LINKR[m] + 0, k + k + 1, and return to T2. J 


6. Prove by induction that the path to the number 2° + 2° +.--+2%, if eo >e1 > 
+) >e,>0, is 1, 2, 27,..., 2°, 2° 42%, ..., 2° 42% +...4 2°: furthermore, the 
sequences of exponents on each level are in decreasing lexicographic order. 


7. The binary and factor methods require one more step to compute x?” than z”; 
the power tree method requires at most one more step. Hence (a) 15 - 2*; (b) 33 - 2*; 
(c) 23-2"; k =0, 1, 2,3,.... 

8. The power tree always includes the node 2m at one level below m, unless it occurs 
at the same level or an earlier level; and it always includes the node 2m + 1 at one 
level below 2m, unless it occurs at the same level or an earlier level. [It is not true that 
2m is a child of m in the power tree for all m; the smallest example where this fails 
is m = 2138, which appears on level 15, while 4276 appears elsewhere on level 16. In 
fact, 2m sometimes occurs on the same level as m; the smallest example is m = 6029.] 


9. Start with N + n, Z + a, and Y} + 1 for 1 < q < m, q odd; in general we 
will have z” = Y:Y3Y2...¥Y,"7'Z% as the algorithm proceeds. Assuming that N > 0, 
set k + N mod m, N + |N/m|. Then if k = 0, set Z + Z™ and repeat; otherwise 
if k = 2?q where q is odd, set Z + Z2, Y + Y,: Z, and if N > 0 set Z + Z2” 
and repeat. Finally set Yp + Yp - Yk+2 for k = m — 3, m — 5, ..., 1; the answer is 
Yı (YY; ... Ym-1)”. (About m/2 of the multiplications are by 1.) 


10. By using the “PARENT” representation discussed in Section 2.3.3: Make use of a 
table p[j], 1 < j < 100, such that p[1] = 0 and p[j] is the number of the node just 
above j for j > 2. (The fact that each node of this tree has degree at most two has no 
effect on the efficiency of this representation; it just makes the tree look prettier as an 
illustration.) 
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11. 1, 2, 3, 5, 10, 20, (23 or 40), 43; 1, 2, 4, 8, 9, 17, (26 or 34), 43; 1, 2, 4, 8, 9, 17, 
34, (43 or 68), 77; 1, 2, 4, 5, 9, 18, 36, (41 or 72), 77. If either of the last two paths 
were in the tree we would have no possibility for n = 48, since the tree must contain 
either 1, 2, 3, 5 or 1, 2, 4, 8, 9. 


12. No such infinite tree can exist, since I(n) A l* (n) for some n. 


13. For Case 1, use a Type-1 chain followed by 24+° + 28+° 4+ 24 42: or use the 
factor method. For Case 2, use a Type-2 chain followed by 24+¢+1 + 28+¢ 4 24 4 9%, 
For Case 3, use a Type-5 chain followed by addition of 24 424-1 or use the factor 
method. For Case 4, n = 135-2”, so we may use the factor method. 


14. (a) It is easy to verify that steps r — 1 and r — 2 are not both small, so let us 
assume that step r—1 is small and step r — 2 is not. If c= 1, then A(a-_1) = A(ar_x), 
so k = 2; and since 4 < v(ar) = v(ar-1) + V(Gr_x) — 1 < v(ar-ı) + 1, we have 
v(ar—1) > 3, making r — 1 a star step (lest ao, a1, ..., @r—3, Gr—1 include only one 
small step). Then a,_1 = ar—-2 + ar-q for some q, and if we replace a,_2, Gr—1, 
ar by @r—2, 24r—2, 2ar—2 + Gr—qg = ar, we obtain another counterexample chain in 
which step r is small; but this is impossible. On the other hand, if c > 2, then 
4 < v(a,) < v(ar-1) + v(ar-k)— 2 < v(ap—1); hence v(a,_1) = 4, v(ar-k) = 2, and 
c= 2. This leads readily to an impossible situation by a consideration of the six types 
in the proof of Theorem B. 

(b) If A(ar-k) < m — 1, we have c > 3, so v(ar_x~) + v(ar-1) > 7 by (22); 
therefore both v(ar—-x) and v(ar—1) are > 3. All small steps must be < r — k, and 
Alar-k) = m—k+1. If k > 4, we must have c = 4, k = 4, v(a,_-1) = v(ar—-4) = 4; 
thus ar—-1 > 2™ +2™-1 4 gn? and ar—-ı must equal 2” + Qk =R gms. but 
p-a > $ar—ı now implies that ar—ı = 8ar—4. Thus k = 3 and a,_1 > 2™ + 2™7t, 
Since ap_2 < 2™ and ap-3 < 2™~', step r — 1 must be a doubling; but step r — 2 
is a nondoubling, since ar_1 # 4ar—3. Furthermore, since v(ar-3) > 3, r—3 isa 
star step; and ar-2 = ar—3 + @r—-5 would imply that ar-5 = gina? hence we must 
have a,_2 = ar-3 + a,_4. As in a similar case treated in the text, the only possibility 
is now seen to be ar-a = 27-7 4273 g,_3 = 2777 4 2™-3 4 gH 4 94 gy = 
gm +271 4 g¢t2 4 94+1 and even this possibility is impossible. 


15. Achim Flammenkamp [Diplomarbeit in Mathematics (Bielefeld University, 1991), 
Part 1] has shown that the numbers n with A(n) + 3 = I(n) < l* (n) all have the form 
24 +27 +20 +2P 42" where A> B >C >D > E and B+ E = C + D; moreover, 
they are described precisely by not matching any of the following eight patterns where 
le] < 1: 2442473 4 20 4 20-1 4 220+2-A 9A 4 94-1 4 90 4 gD 4 90+D+1-A 


9A 2B 92B-A+3 4 92B+2—-A =e JI BTOS2A, 94 4 9B a 92B-Ate om 2P 4 QB+D+e—A. 


24 4 2B 498-14 2P 4 9P-1 944 9B 4 9B-24 2P 4 2P- (45 B41), 24428 4 
9¢ L 22B € AB Cte A g449B 1 90 4 9BtCte-A | 92Cte—A 


16. 1?(n) = A(n) + y(n) — 1; so if n = 2", IP (n)/A(n) = 1, but if n = 2+! — 1, 
1? (n)/X(n) = 2. 


17. Let ii <--- <%. Delete any intervals Ip that can be removed without affecting 
the union Jı U---U I. (The interval (jx ..%,] may be dropped out if either jk+1 < jk 
or jı < jo < +--+ and jk+41 < te-1.) Now combine overlapping intervals (j1..%1], ..., 
(ja--ta] into an interval (j’..i’] = (jı . . ia] and note that 


ay < aj (1 a) a tiaa <a + geet). 


since each point of (j’..2’] is covered at most twice in (j1..i1] U ++- U (ja. ial. 
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18. Call f(m) a “nice” function if (log f(m))/m —> 0 as m > oo. A polynomial 
in m is nice. The product of nice functions is nice. If g(m) —> 0 and c is a positive 
constant, then c9”) is nice; also lance) is nice, for by Stirling’s approximation this 
is equivalent to saying that g(m) log(1/g(m)) — 0. 

Now replace each term of the summation by the maximum term that is attained for 
any s, t, v. The total number of terms is nice, and so are (Fes (e <2'*” and 8”, 
because (t+ v)/m —> 0. Finally, (9°) < (2m)™%/t! < (4em?/t)*, where (4e)* is nice. 
Replacing t by its upper bound (1 —€/2)m/A(m) shows that (m?/t)' < 2™¢~¢/?) f(m), 
where f(m) is nice. Hence the entire sum is less than a” for large m if a = 217”, 
where 0 <n < se. 


19. (a) MAN, MUN, M wN, respectively; see Eqs. 4.5.2-(6), 4.5.2-(7). 

b) f(z)g(z), lem(f(z), 9(z)), ged(f(z), g(z)). (For the same reasons as (a), be- 
cause the monic irreducible polynomials over the complex numbers are precisely the 
polynomials z — ¢.) 

c) Commutative laws AWB = BWA, AUB = BUA, ANB = BNA. Associative 
laws AW (BWC) = (AW B)WC, AU(BUC) = (AUB)UC, AN(BNC) = (ANB)NC. 
Distributive laws AU (B N C) = (AU B)N(AUC), AN (BUC) = (AN B)U (ANC), 
AW(BUC) = (Aw B)u (Aw C), AY(BNC) = (AW B)N (AWC). Idempotent 
laws AU A = A, AN A = A. Absorption laws A U (AN B) = A, AN (AU B) = A, 
AN (AWB) = A, AU (AW B) = AWB. Identity and zero laws w A = A, ØU A = A, 
ØN A = Ý, where Ø is the empty multiset. Counting law AW B = (AU B) w (AN B). 
Further properties analogous to those of sets come from the partial ordering defined by 
the rule A C B if and only if AN B = A (if and only if AU B = B). 

Notes: Other common applications of multisets are zeros and poles of meromor- 
phic functions, invariants of matrices in canonical form, invariants of finite Abelian 
groups, etc.; multisets can be useful in combinatorial counting arguments and in the 
development of measure theory. The terminal strings of a noncircular context-free 
grammar form a multiset that is a set if and only if the grammar is unambiguous. The 
author’s paper in Theoretical Studies in Computer Science, edited by J. D. Ullman 
(Academic Press, 1992), 1-13, discusses further applications to context-free grammars, 
and introduces the operation A M B, where each element that occurs a times in A and 
b times in B occurs ab times in AN B. 

Although multisets appear frequently in mathematics, they often must be treated 
rather clumsily because there is currently no standard way to treat sets with repeated 
elements. Several mathematicians have voiced their belief that the lack of adequate 
terminology and notation for this common concept has been a definite handicap to 
the development of mathematics. (A multiset is, of course, formally equivalent to a 
mapping from a set into the nonnegative integers, but this formal equivalence is of little 
or no practical value for creative mathematical reasoning.) The author discussed this 
matter with many people during the 1960s in an attempt to find a good remedy. Some 
of the names suggested for the concept were list, bunch, bag, heap, sample, weighted 
set, collection, suite; but these words either conflicted with present terminology, had an 
improper connotation, or were too much of a mouthful to say and to write conveniently. 
Finally it became clear that such an important concept deserves a name of its own, 
and the word “multiset” was coined by N. G. de Bruijn. His suggestion was widely 
adopted during the 1970s, and it is now standard terminology. 

The notation “AWB” has been selected by the author to avoid conflict with existing 
notations and to stress the analogy with set union. It would not be as desirable to use 
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“ A+ B” for this purpose, since algebraists have found that A+ B is a good notation for 
the multiset {a+ 8|a€ A and 8 € B}. If A is a multiset of nonnegative integers, let 
G(z) = J nea 2” be a generating function corresponding to A. (Generating functions 
with nonnegative integer coefficients obviously correspond one-to-one with multisets of 
nonnegative integers.) If G(z) corresponds to A and H(z) to B, then G(z) + H(z) 
corresponds to AW B and G(z)H(z) corresponds to A+ B. If we form “Dirichlet” 
generating functions g(z) = X e4 1/77, h(z) = X neg 1/n’, then the product g(z)h(z) 
corresponds to the multiset product AB. 

20. Type 3: (S0,..., Sr) = (Moo,...,Mro) = ({O}, ..., {A}, {A-1, A}, {A—1, A, A}, 
{A-—1,A-—1,A,A, A},..., {A+C-—3,A+C-—3,A+C-2,A+C-—2,A+C—2}). 


Type 5: (Moo, neg Myo) = ({O}, , {4}, {A = 1, A}, A +C-1,A+ C}, 
{A+C LAFO 1,A4 Ch ine (AL OLD 1A+C+D 1,4+ C+D}; 
(Moi,---;Mri) = (0, ...,0, 0,..., 0, {A+ C—2}, ..., {A+C+D—2}), Si = Miot Mi. 
21. For example, let u = 289+5, « = (20+) — 1)/(Q" — 1) = 2% +... 4 2" +1, 


y = 2949444, Then cy = (2704+) — 1)/(2% — 1). If n = 240494 4 oy, we have 
I(n) < 4(q+1)u+q+2 by Theorem F, but /*(n) = 4(q + 1)u + 2q + 2 by Theorem H. 


22. Underline everything except the u — 1 insertions used in the calculation of x. 


23. Theorem G (everything underlined). 

24. Use the numbers (B% — 1)/(B — 1), 0 < i < r, underlined when a; is underlined; 
and ck B*~'(B°i —1)/(B—1) for0 < j <t,0 < i< bj41—b;, 1 < k < 1°(B), underlined 
when cx is underlined, where co, c1, ... is a minimum length l°-chain for B. To prove 
the second inequality, let B = 2™ and use (3). (The second inequality is rarely, if ever, 
an improvement on Theorem G.) 


25. We may assume that d = 1. Use the rule R A,_1...Ai, where A; = “XR” if 
d; = 1, A; = “R” otherwise, and where “R” means take the square root, “X” means 
multiply by x. For example, if y = (.1101101)2, the rule is R R XR XR R XR XR. 
(There exist binary square-root extraction algorithms suitable for computer hardware, 
requiring an execution time comparable to that of division; computers with such 
hardware could therefore calculate more general fractional powers using the technique 
in this exercise.) 

26. If we know the pair (Fk, Fk—1), then we have (Fk+1, Fk) = (Fk + Fe-1, Fk) and 
(Fok, Fok—1) = (FÈ +2FpFk-1, F? +F); so a binary method can be used to calculate 
(Fn, Fn—1), using O(logn) arithmetic operations. Perhaps better is to use the pair 
of values (Fp, Lk), where Ly = Fk-1 + Fr4i (see exercise 4.5.4-15); then we have 
(Feta, Lari) = (3 (Fr + Le), 5(5Fe + Le), (For, Lor) = (Fr Lr, Lg — 2(-1)*). 

For the general linear recurrence £n = @1%n—1 +++: +@a%n—a, We can compute 
£n in O(a logn) arithmetic operations by computing the nth power of an appropriate 
d x d matrix. [This observation is due to J. C. P. Miller and D. J. Spencer Brown, 
Comp. J. 9 (1966), 188-190.] In fact, as Richard Brent has observed, the number 
of operations can be reduced to O(d? logn), or even to O(dlog dlogn) using exercise 
4.7-6, if we first compute x” mod (xt — a,x¢~! — --- — aa) and then replace x? by Tj: 
27. The smallest n requiring s small steps must be c(r) for some r. For if e(r) < n < 
c(r+1) we have I(n)— A(n) < r—A(e(r)) = U(c(r)) — A(c(r)). The answers for 1 < s < 8 
are therefore 3, 7, 29, 127, 1903, 65131, 4169527, 994660991. 

28. (a) £Vy = z | y | (x + y), where “|” is bitwise “or”, see exercise 4.6.2—26; clearly 
v(£Vy) < v(x|y)+v(z&y) = v(x)+v(y). (b) Note first that Aj_1/2%-1 C A;/2% for 
1<i< r. Secondly, note that d; = dj-1 in a nondoubling; for otherwise aj_1 > 2a; > 
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a; + ap = ai. Hence A; C Aj_1 and A, C A;_1/24-%, (c) An easy induction on i, 
except that close steps need closer attention. Let us say that m has property P(a) if 
the 1s in its binary representation all appear in consecutive blocks of > a in a row. If 
m and m’ have P(a), so does mVm’; if m has P(a) then p(m) has P(a + ô). Hence 
Bi has P(1 + 6c:). Finally if m has P(a) then v(p(m)) < (a + 5)v(m)/a; for v(m) = 
Vi +++++Vq, where each block size vj is > a, hence v(p(m)) < (vı +6) +--+ (v4 +8) < 
(1+6/a)m+---+(1+46/a)vg. (d) Let f = b,+c¢, be the number of nondoublings and s 
the number of small steps. If f > 3.2711gv(n) we have s > lgv(n) as desired, by (16). 
Otherwise we have a; < (1+ 2~°)*2%+% for 0 < i <r, hence n < ((1 + a=") j2) ar. 
and r > Ign + b, — by lg(1+ 27°) > Ign + Igv(n) — lg(1 + dc.) — by lg(1 + 27°). Let 
6 = [lg(f +1)]; then In(1 + 27°) < n(1+1/(f +1)) < 1/(f +1) < 6/(14+ df), and it 
follows that lg(1 + dx) + (f — x)lg(1+27°) < lg(1+ df) for 0 < a < f. Hence finally 
U(n) > lgn+lgv(n) —lg(1 + (3.271 1g v(n)) flg(1+ 3.271 1gv(n))]). [Theoretical Comp. 
Sci. 1 (1975), 1-121] 

29. [Canadian J. Math. 21 (1969), 675-683. Schénhage refined the method of exer- 
cise 28 to prove that I(n) > lgn+lgv(n) — 2.13. Can the remaining gap be closed?] 


30. n = 31 is the smallest example; 1(31) = 7, but 1, 2, 4, 8, 16, 32, 31 is an addition- 
subtraction chain of length 6. [After proving Theorem E, Erdés stated that the same 
result holds also for addition-subtraction chains. Schénhage has extended the lower 
bound of exercise 28 to addition-subtraction chains, with v(n) replaced by 7(n) as 
defined in exercise 4.1-34. A generalized right-to-left binary method for exponentiation, 
which uses \(n)+0(n) —1 multiplications when both x and x~' are given, can be based 
on the representation a,, of that exercise.] 


32. See Discrete Math. 23 (1978), 115-119. [This cost model corresponds to mul- 
tiplication of large numbers by a classical method like Algorithm 4.3.1M. Empirical 
results with a more general model in which the cost is (ajax)? /2 have been obtained 
by D. P. McCarthy, Math. Comp. 46 (1986), 603-608; this model comes closer to the 
“fast multiplication” methods of Section 4.3.3, when two n-bit numbers are multiplied 
in O(n? ) steps, but the cost function aja, -1 would actually be more appropriate (see 
exercise 4.3.3-13). H. Zantema has analyzed the analogous problem when the cost of 
step i is aj + ax instead of ajax; see J. Algorithms 12 (1991), 281-307. In this case 
the optimum chains have total cost on + O(n” 2), Furthermore the optimum additive 
cost when n is odd is at least 3(n — 1), with equality if and only if n can be written as 


a product of numbers of the form 2* + 1.] 

33. Eight; there are four ways to compute 39 = 12 + 12 + 12 + 3 and two ways to 
compute 79 = 39+ 39+1. 

34. The statement is true. The labels in the reduced graph of the binary chain are 
[n/2*| for k = eo, ..., 0; they are 1, 2,..., 2°°, n in the dual graph. [Similarly, the 
right-to-left m-ary method of exercise 9 is the dual of the left-to-right method.] 


35. 2° are equivalent to the binary chain; it would be 2°~' if eo = e1 +1. The number 
of chains equivalent to the scheme of Algorithm A is the number of ways to compute 
the sum of t + 2 numbers of which two are identical. This is 4 fita + aft, where fm 
is the number of ways to compute the sum of m + 1 distinct numbers. When we take 
commutativity into account, we see that fm is 27™ times (m + 1)! times the number 
of binary trees on m nodes, so fm = (2m —1)(2m — 3)...1. 


36. First form the 2” —m—1 products zf! ...2%", for all sequences of exponents such 
that 0 < ep < 1 and e1 +--+ em > 2. Let np = (dey... dkıdko)2; to complete the 
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calculation, take pita po rir, then square and multiply by rii „g$ fori =åÀ-—1, 


, 1, 0. [Straus showed in AMM 71 (1964), 807-808, that 2\(n) may be replaced by 
(1+e) A(n) for any e€ > 0, by generalizing this binary method to 2*-ary as in Theorem D.] 


37. (Solution by D. J. Bernstein.) Let n = nm. First compute 2° for 1 < e < X(n), 
then compute each nj in A(n)/AA(n) + O(A(n)AAA(n)/AA(n)”) further steps by the 
following variant of the 2*-ary method, where k = |lglgn — 2lglglgn]: For all odd 
q < 2", compute yg = >{2"*** | dy = 2°q} where ny; = (...dido)a«, in at most |Z lgn| 
steps; then use the method in the final stages of answer 9 to compute n; = > qyq with 
at most 2* — 1 further additions. 

[A generalization of Theorem E gives the corresponding lower bound. Reference: 
SICOMP 5 (1976), 100—103.] 


38. The following construction due to D. J. Newman provides the best upper bound 
currently known: Let k = pı ... pr be the product of the first r primes. Compute k and 
all quadratic residues mod k in O(2-"klogk) steps (because there are approximately 
2-"k quadratic residues). Also compute all multiples of k that are < m2, in about 
m?/k further steps. Now m additions suffice to compute 17, 27, ..., m?. We have k = 
exp(pr + O(p,/(log pr)+°°°)) where p, is given by the formula in the answer to exercise 
4.5.4-36; see, for example, Greene and Knuth, Math. for the Analysis of Algorithms 
(Boston: Birkhauser, 1981), §4.1.6. So by choosing 


r= |(1+ 5 ln2/lglgm)nm/Innm| 


it follows that I(1?,...,m”) = m + O(m - exp(—(3 ln 2 — €) nm/InInm)). 
On the other hand, D. Dobkin and R. Lipton have shown that, for any e€ > 0, 
I(1?,..., m?) > m+m?/3~¢ when m is sufficiently large [SICOMP 9 (1980), 121-125]. 


39. The quantity l([n1, n2, . .., Nm]) is the minimum of arcs—vertices+m taken over all 
directed graphs having m vertices s; whose in-degree is zero and one vertex t whose out- 
degree is zero, where there are exactly n; oriented paths from sj to t for 1< j <m. 
The quantity l(n1, n2,..., Nm) is the minimum of arcs — vertices + 1 taken over all 
directed graphs having one vertex s whose in-degree is zero and m vertices t; whose 
out-degree is zero, where there are exactly nj oriented paths from s to t; for 1 < j < m. 
These problems are dual to each other, if we change the direction of all the arcs. [See 
J. Algorithms 2 (1981), 13-21.] 

Note: C. H. Papadimitriou has observed that this is a special case of a much more 
general theorem. Let N = (nij) be an m x p matrix of nonnegative integers having 
no row or column entirely zero. We can define I(N) to be the minimum number of 
multiplications needed to compute the set of monomials {x1 ... æm |1 <j < pl. 
Now (N) is also the minimum of arcs — vertices + m taken over all directed graphs 
having m vertices s; whose in-degree is zero and p vertices t; whose out-degree is zero, 
where there are exactly nij oriented paths from s; to t; for each 7 and j. By duality 
we have I(N) = (NT) +m — p. [Bulletin of the EATCS 13 (February 1981), 2-3.] 

N. Pippenger has considerably extended the results of exercises 36 and 37. For 
example, if L(m,p,n) is the maximum of I(N) taken over all m x p matrices N of 
nonnegative integers nij < n, he showed that L(m, p,n) = min(m, p)lgn+ H/lg H + 
O(m + p+ H(log log H)'/? (log H)~°/?), where H = mplg(n +1). [See SICOMP 9 
(1980), 230-250.] 
40. By exercise 39, it suffices to show that I(mini +---+ mint) < U(mi,...,me) + 


U([ni,...,n+]). But this is clear, since we can first form {xz!,...,2’*} and then 
compute the monomial (#!)"!... (a@’™*)™, 
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Note: One strong way to state Olivos’s theorem is that if ao, ..., a, and bo, ..., bs 
are any addition chains, then (X cijaibj) < r +s+} cij — 1 for any (r +1) x (s +1) 
matrix of nonnegative integers cij. 


41. [SICOMP 10 (1981), 638-646.] The stated formula can be proved whenever A > 
9m?. Since this is a polynomial in m, and since the problem of finding a minimum 
vertex cover is NP-hard (see Section 7.9), the problem of computing I(ni,...,mm) is 
NP-complete. [It is unknown whether or not the problem of computing l(n) is NP- 
complete. But it seems plausible that an optimum chain for, say, ue np4124k?’ 
would entail an optimum chain for {n1,..., Nm}, when A is sufficiently large.] 


42. The condition fails at 128 (and in the dual 1, 2, ..., 16384, 16385, 16401, 32768, 
. at 32768). Only two reduced digraphs of cost 27 exist; hence 1°(5784689) = 28. 
Furthermore, Clift’s programs proved that 1° (n) = I(n) for all smaller values of n. 


SECTION 4.6.4 
1. Set ye x, then compute ((. . . (U2n+1Y + Uzn—-1)y +- )y + ui). 
2. Replacing x in (2) by the polynomial x + zo leads to the following procedure: 


G1. Do step G2 for k = n, n — 1, ..., 0 (in this order), and stop. 


G2. Set vk < uk, and then set vj + vj + eovj4i1 for j = k, K+1,...,n—-1. 
(When k = n, this step simply sets Vn < Un.) I 


The computations turn out to be identical to those in H1 and H2, but performed in a 
different order. (This process was Newton’s original motivation for using scheme (2).) 


3. The coefficient of x* is a polynomial in y that may be evaluated by Horner’s rule: 
(... (Un o£ + (Un—1,1yt+Un—1,0))a+:--)a+((...(WonytUon—1)y++::)y+uo,0). [Fora 
“homogeneous” polynomial, such as ung” +Un—12" y+-+-+urcy” + +uoy”, another 
scheme is more efficient: If 0 < |z| < |y|, first divide x by y, evaluate a polynomial in 
x/y, then multiply by y”.] 


4. Rule (2) involves 4n or 3n real multiplications and 4n or 7n real additions; (3) is 
worse, it takes 4n + 2 or 4n + 1 multiplications, 4n + 2 or 4n + 5 additions. 


5. One multiplication to compute x°; |n/2| multiplications and |n/2| additions to 
evaluate the first line; [n/2] multiplications and [n/2] — 1 additions to evaluate the 
second line; and one addition to add the two lines together. Total: n+1 multiplications 
and n additions. 


6. J1. Compute and store the values 72, #3, ..., onl? 


J2. Set vj & uzri |n/? forO<j<n. 
J3. For k = 0, 1, ..., n— 1, set vj — vj + uj41 for j =n—-1,...,k4+1,k. 
J4. Set vj & vja l- forO<j<n J 


There are (n? +n) /2 additions, n+ [n/2] —1 multiplications, n divisions. Another mul- 
tiplication and division can be saved by treating vn and vo as special cases. Reference: 
SIGACT News 7,3 (Summer 1975), 32-34. 


7. Let zj = £o + jh, and consider (42) and (44). Set yj + u(a#;) for 0 < j <n. For 
k = 1, 2, ..., n (in this order), set yj < yj — yj—-1 for j = n, n— 1, ..., k (in this 
order). Now set 8; < y; for all j. 

However, rounding errors will accumulate as explained in the text, even if the 
operations of (5) are done with perfect accuracy. A better way to do the initialization, 
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when (5) is performed with fixed point arithmetic, is to choose 6o, ..., Bn so that 
0 0 0 
(o) (3) (a) Bo u(Zo) €0 
() (d (1) | | & | | ula) e 
: ; P : = : + . : 
A (ra) E/B) Nuen) Nen 
where |€o|, |e1|, ..., |En] are as small as possible. [H. Hassler, Proc. 12th Spring Conf. 


Computer Graphics (Bratislava: Comenius University, 1996), 55-66.] 
8. See (43). 


9. [Combinatorial Mathematics (Buffalo: Math. Assoc. of America, 1963), 26-28.] 
This formula can be regarded as an application of the principle of inclusion and 
exclusion (Section 1.3.3), since the sum of the terms for n — «1 <... — én = kis 
the sum of all 1j; Z2jz -- -Enja for which k values of the ji do not appear. A direct 
proof can be given by observing that the coefficient of 71;, ...¢nj, is 


Ge) ea ere 


if the j’s are distinct, this equals unity, but if 71, ..., jn Æ k then it is zero, since the 
terms for €% = 0 cancel the terms for €k = 1. 
To evaluate the sum efficiently, we can start with «4 1, €2 tee En 0, 


and we can then proceed through all combinations of the e’s in such a way that only 
one € changes from one term to the next. (See “Gray binary code” in Section 7.2.1.1.) 
The first term costs n — 1 multiplications; the subsequent 2” — 2 terms each involve n 
additions, then n — 1 multiplications, then one more addition. Total: (2% — 1)(n — 1) 
multiplications, and (2” — 2)(n + 1) additions. Only n + 1 temporary storage locations 
are needed, one for the main partial sum and one for each factor of the current product. 


10. icgen(k + 1) (oo) = n(2”~1 — 1) multiplications and ice "Gy = 
n2”-' — 2” +1 additions. This is approximately half as many arithmetic operations as 
the method of exercise 9, although it requires a more complicated program to control 
the sequence. Approximately (ma) + (in jaya) temporary storage locations must be 
used, and this grows exponentially large (on the order of 2"/,/7). 

The method in this exercise is equivalent to the unusual matrix factorization of 
the permanent function given by Jurkat and Ryser in J. Algebra 3 (1966), 1-27. It may 
also be regarded as an application of (39) and (40), in an appropriate sense. 


11. Efficient methods are known for computing an approximate value, if the matrix 
is sufficiently dense; see A. Sinclair, Algorithms for Random Generation and Counting 
(Boston: Birkhauser, 1993). But this problem asks for the exact value. There may be 
a way to evaluate the permanent with O(c”) operations for some c < 2. 


12. Here is a brief summary of progress on this famous research problem: J. Hopcroft 
and L. R. Kerr proved, among other things, that seven multiplications are necessary in 
2 x 2 matrix multiplication modulo 2 [SIAM J. Appl. Math. 20 (1971), 30-36]. R. L. 
Probert showed that all 7-multiplication schemes, in which each multiplication takes a 
linear combination of elements from one matrix and multiplies by a linear combination 
of elements from the other, must have at least 15 additions [SICOMP 5 (1976), 187- 
203]. The tensor rank of 2 x 2 matrix multiplication is 7 over every field [V. Y. Pan, 
J. Algorithms 2 (1981), 301-310]; the rank of T(2,3,2), the tensor for the product 
of a 2 x 3 matrix by a 3 x 2 matrix, is 11 [V. B. Alekseyev, J. Algorithms 6 (1985), 
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71-85]. For n x n matrix multiplication, the best upper bound known when n = 3 
is due to J. D. Laderman [Bull]. Amer. Math. Soc. 82 (1976), 126-128], who showed 
that 23 noncommutative multiplications suffice. His construction has been generalized 
by Ondrej Sýkora, who exhibited a method requiring n? — (n — 1)? noncommutative 
multiplications and n? — n? +11(n— 1)? additions; this result also reduces to (36) when 
n = 2 [Lecture Notes in Comp. Sci. 53 (1977), 504-512]. For n = 5, the current record 
is 100 noncommutative multiplications [O. M. Makarov, USSR Comp. Math. and Math. 
Phys. 27,1 (1987), 205-207]. The best lower bound known so far is due to Markus 
Blaser, who showed that 2n? +n — 3 nonscalar multiplications are necessary for n > 2, 
and mn+ns+m—n+s—3 inthe m xn x s case for n > 2 and s > 2 [Computational 
Complexity 8 (1999), 203-226]. If all calculations must be done without division, 
slightly better lower bounds were obtained by N. H. Bshouty [SICOMP 18 (1989), 
759-765], who proved that m x n by n x s matrix multiplication mod 2 requires at least 
75lms/2*| + i(n + (n mod j))(n — (n mod j) — j) + n mod j multiplications when 

n > s> j> 1; setting m = n = s and j ~ lgn gives 2.5n? — in lgn + O(n). 
The best upper bounds known for large n are discussed in the text, following (36). 


13. By summing geometric series, we find that F(ti,...,tn) equals 


DE EIE e aks exp(—27i(siti/mi + +++ + Sntn/Mn) f(s1,---,8n))/m1--.Mn- 


The inverse transform times m1 ...mn can be found by doing a regular transform and 
interchanging tj with mj — tj when tj # 0; see exercise 4.3.3-9. 

[If we regard F(ti,...,tn) as the coefficient of xj! ...a'" in a multivariate polyno- 
mial, the discrete Fourier transform amounts to evaluation of this polynomial at roots 
of unity, and the inverse transform amounts to finding the interpolating polynomial.] 


14. Let mi = ++ = Mn = 2, F(ti,te,...,tn) = F(2"7'tn + +++ + 2t2 + tı), and 
f (81, $2,---,8n) = f(2"181 iQ? Be Bowes e Sn); note the reversal between t’s and 
s’s. Also let gk(Sk,..-,Sn,tk) be w raised to the 2*-1tp(sn + 2sn-1 +--+» + 2" *s,) 
power. Replace fk(Sn-k+1,---,Sn,t1;,.--- tnk) by fr(ti,-.-,tn—k, 8n—k41,+++)$n) in 
(40) if you prefer to work in situ. 

At each iteration we essentially take 2”~' pairs of complex numbers (a, 8) and 
replace them by (a+¢8, a—C¢@), where ¢ is a suitable power of w, hence ¢ = cos 0+i sin 0 
for some 0. If we take advantage of simplifications when ¢ = +1 or +i, the total work 
comes to ((n — 3) - Vee 2) complex multiplications and n- 2” complex additions; the 
techniques of exercise 41 can be used to reduce the real multiplications and additions 
used to implement these complex operations. 

The number of complex multiplications can be reduced about 25 percent without 
changing the number of additions by combining passes k and k + 1 for k = 1, 3, ...; 
this means that 2”~? quadruples (a, 8, y, ô) are being replaced by 


(A+ 6B + C7 +075, a+ 168 — Cy - 1075, a CB + CY — C5, a igh — C77 +1678). 


The total number of complex multiplications when n is even is thereby reduced to 
(3n — 2)2”-3 — 5|2"-1/3]. 

These calculations assume that the given numbers F(t) are complex. If the F(t) 
are real, then f(s) is the complex conjugate of f(2”—s), so we can avoid the redundancy 


by computing only the 2” independent real numbers f(0), Rf(1), ..., Rf(2"~" — 1), 
f(2""'), SFA), ..., Sf(2”7} — 1). The entire calculation in this case can be done 
by working with 2” real values, using the fact that fx(sn—e+1,---,8n,t1,-.-,tn—k) 


will be the complex conjugate of fx(s),_p41,--+)8nsti,---,tn—k) when (81... Sn)2 + 
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(si, .-.8),)2 = 0 (modulo 2”). About half as many multiplications and additions are 
needed as in the complex case. 

[The fast Fourier transform algorithm was discovered by C. F. Gauss in 1805 and 
independently rediscovered many times since, most notably by J. W. Cooley and J. 
W. Tukey, Math. Comp. 19 (1965), 297-301. Its interesting history has been traced 
by J. W. Cooley, P. A. W. Lewis, and P. D. Welch, Proc. IEEE 55 (1967), 1675-1677; 
M. T. Heideman, D. H. Johnson, and C. S. Burrus, IEEE ASSP Magazine 1, 4 (October 
1984), 14-21. Details concerning its use have been discussed by hundreds of authors, 
admirably summarized by Charles Van Loan, Computational Frameworks for the Fast 
Fourier Transform (Philadelphia: SIAM, 1992). For a survey of fast Fourier transforms 
on finite groups, see M. Clausen and U. Baum, Fast Fourier Transforms (Mannheim: 
Bibliographisches Institut Wissenschaftsverlag, 1993).] 


15. (a) The hint follows by integration and induction. Let f™ (0) take on all values be- 
tween A and B inclusive, as 0 varies from min(zo,..., £n) to max(zo,..., £n). Replac- 
ing f‘™ by each of these bounds, in the stated integral, yields A/n! < f(ao,...,2n) < 
B/n!. (b) It suffices to prove this for j = n. Let f be Newton’s interpolation 
polynomial, then f™ is the constant n!an. [See The Mathematical Papers of Isaac 
Newton, edited by D. T. Whiteside, 4 (1971), 36-51, 70-73.] 

16. Carry out the multiplications and additions of (43) as operations on polynomials. 
(The special case £o = 41 = -+> = £n is considered in exercise 2. We have used this 
method in step T8 of Algorithm 4.3.3T.) 


17. For example, when n = 5 we have 
Yo 5yı _ _10y2 10y3 5y4 Ys 
T — Xo v—- Ti T T3 T =Á 3 z — T4 T — T5 


ujs] (x) = ’ 
1 5 10 10 5 1 


T — xo Tt — 1 T — T2 T— T3 T — T4 T — T5 


independent of the value of h. 


18. ao = 4 (u3/ua +1), 8 = u2/ua — ao(ao — 1), a1 = aof — ui/ua, a2 = B — 201, 
az = uo/ua — a1(ai + a2), &4 = u4. 


19. Since as is the leading coefficient, we may assume without loss of generality that 
u(x) is monic (namely that us = 1). Then ao is a root of the equation 4023 — 24uaz? + 
(4uz + 2u3)z + (u2 — uzu4) = 0; this equation always has at least one real root, and it 
may have three. Once ao is determined, we have a3 = ua —4a0, a1 = u3—4a0a3 —6a2, 
az = U1 — ao (aoa: 4 Aagaz + 20103 4 ae), a4 = Uo a3 (ag haiak 4 az). 

For the given polynomial we are to solve the cubic equation 40z? — 120z? + 
80z = 0; this leads to three solutions (ao, a1, a@2,03,4,05) = (0,—10,13,5,—5,1), 
(1, —20, 68, 1, 11,1), (2, -10, 13, —3, 27,1). 


20. LDA X STA TEMP2 FADD =qı= FMUL TEMP1 
FADD =a3= FMUL TEMP2 FMUL TEMP2 FADD =a4= 
STA TEMP1i STA  TEMP2 FADD =a2= FMUL =a5= | 


FADD =ag-Q3= 
21. z= (x+1)z—2, w= (x+5)z+9, u(x) = (w+ z-—8)w-— 8; or z = (x +9)x +26, 
w = (x — 3)z + 73, u(x) = (w + z — 24)w — 12. 
22. agp = 1l, œo = lL a@i= l, pr= 2, B2= 2, 8a = 2, Pa = 1, 03 = 4,a2=0, 
as = 4, as = —2. We form z = (x—1)x+1, w = z+2, and u(x) = ((z-2—4)w+4)z-2. 
Here one of the seven additions can be saved if we compute w = £? +1, z = w — zx. 
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23. (a) We may use induction on n; the result is trivial if n < 2. If f(0) = 0, then 
the result is true for the polynomial f(z)/z, so it holds for f(z). If f(¢y) = 0 for some 
real y # 0, then g(+iy) = A(+iy) = 0; since the result is true for f(z)/(z* + y’), it 
holds also for f(z). Therefore we may assume that f(z) has no roots whose real part 
is zero. Now the net number of times the given path circles the origin is the number of 
roots of f(z) inside the region, which is at most 1. When R is large, the path f(Re*) 
for 7/2 < t < 3/2 will circle the origin clockwise approximately n/2 times; so the 
path f(it) for —R < t < R must go counterclockwise around the origin at least n/2— 1 
times. For n even, this implies that f(t) crosses the imaginary axis at least n— 2 times, 
and the real axis at least n — 3 times; for n odd, f(it) crosses the real axis at least 
n — 2 times and the imaginary axis at least n — 3 times. These are roots respectively 
of glit) = 0, h(it) = 0. 

(b) If not, g or h would have a root of the form a+ bi with a Æ 0 and b# 0. But 
this would imply the existence of at least three other such roots, namely a — bi and 
—a + bi, while g(z) and h(z) have at most n roots. 


24. The roots of u are —7, —3 +i, —2 + i, and —1; permissible values of c are 2 and 4 
(but not 3, since c = 3 makes the sum of the roots equal to zero). Case 1: c = 2. Then 
p(x) = (x + 5) (a? + 2a + 2)(a? + 1)(a — 1) = zê + 6x? + 6a* + 4r? — 5x? — 2x — 10; 
qa) = 6a? + 42 — 2 = 6(a + 1)(x 1). Let ag = -1, a1 = $3 pi(z) = gt + 
6x? + 5a” — 2x — 10 = (a? + 6x + 48)(a? — 4) — B; ao = 6, bBo = £, bı = -4%. 
Case 2: c= 4. A similar analysis gives ag = 9, a1 = —3, ao 6, Bo = 12, Br 26. 


25. 81 = a2, B2 = 201, 83 = a7, Bs = a6, Bs = Be = 0, Br = a1, Bg = 0, Bo = 2a — as. 
26. (a) At = a1 X Ao, A2 = @2 + Ai, A3 = A2 X Ao, A4 = AB + Az, As = A4 X Ao, 
às = aa + As. (b) Ky = l+ Bix, k2 =1+ Bokix, K3 =1+ B3K27, u(x) — Baks = 
G1 8283842? + Bo83 Bax? + B3Gax + b14. (c) If any coefficient is zero, the coefficient of x? 
must also be zero in (b), while (a) yields an arbitrary polynomial a2? +a2%?+a3¢%+a4 
of degree < 3. 


27. Otherwise there would be a nonzero polynomial f(qn,...,q1, Go) with integer coeffi- 
cients such that qn - f(qn,---,@1,q0) = 0 for all sets (qn, ...,qo) of real numbers. This 
cannot happen, since it is easy to prove by induction on n that a nonzero polynomial 
always takes on some nonzero value. (See exercise 4.6.1-16. However, this result is 
false for finite fields in place of the real numbers.) 


28. The indeterminate quantities a1, ..., as; form an algebraic basis for the polynomial 
domain Q[ai,...,as], where Q is the field of rational numbers. Since s + 1 is greater 
than the number of elements in a basis, the polynomials f;(a1,...,as) are algebraically 
dependent; this means that there is a nonzero polynomial g with rational coefficients 
such that g(fo(ai,...,as),---,fs(a1,---,@s)) is identically zero. 


29. Given jo, ..., Jj: E {0,1,..., n}, there are nonzero polynomials with integer coeffi- 
cients such that 9;(qj),---,@j:) = 0 for all (qn,...,qo) in Rj, 1 < j < m. The product 
9192 . - - gm is therefore zero for all (qn,...,qo) in Ri U---U Rm. 


30. Starting with the construction in Theorem M, we will prove that mp+(1—4dom,) of 
the 8’s may effectively be eliminated: If u; corresponds to a parameter multiplication, 
we have ui = Boi-1 X (Tzi + Gai); add cß2i—1 b2; to each 8; for which cy; occurs in Tj, 
and replace 82; by zero. This removes one parameter for each parameter multiplication. 
If u: is the first chain multiplication, then pi = (yıx + 01 + Bai-1) x (y2x + 02 + bzi), 
where 71, 72, 91, 02 are polynomials in (1, ..., S2:-2 with integer coefficients. Here 
0; and 02 can be “absorbed” into G2;:-1 and (2:, respectively, so we may assume that 
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0, = 62 = 0. Now add cß2i—1 82i to each 6; for which cy; occurs in Tj; add B2i—1Y2/%1 
to Gai; and set $2;-1 to zero. The result set is unchanged by this elimination of 62;-1, 
except for the values of a1, ..., as such that yı is zero. [This proof is essentially due to 
V. Y. Pan, Uspekhi Mat. Nauk 21,1 (January-February 1966), 103-134.] The latter 
case can be handled as in the proof of Theorem A, since the polynomials with yı = 0 
can be evaluated by eliminating 62; (as in the first construction, where p; corresponds 
to a parameter multiplication). 


31. Otherwise we could add one parameter multiplication as a final step, and violate 
Theorem C. (The exercise is an improvement over Theorem A, in this special case, 
since there are only n degrees of freedom in the coefficients of a monic polynomial of 
degree n.) 


32. ài = ào x Ao, A2 = ai x Ad, A3 = @2 + ro, Aa = AZ x Ad, às = a3 + Aa. We 
need at least three multiplications to compute u4x* (see Section 4.6.3), and at least 
two additions by Theorem A. 


33. We must have n+1 < 2me+mp+dom,., and Me + mM, = (n+1)/2; so there are no 
parameter multiplications. Now the first A; whose leading coefficient (as a polynomial 
in x) is not an integer must be obtained by a chain addition; and there must be at least 
n + 1 parameters, so there are at least n + 1 parameter additions. 


34. Transform the given chain step by step, and also define the “content” ci of ri, as 
follows: (Intuitively, c; is the leading coefficient of A;.) Define co = 1. (a) If the step 
has the form A; = a; + Ax, replace it by Ai = bj + Ax, where 6; = aj/ck; and define 
ci = ck. (b) If the step has the form A; = a; — Ax, replace it by Ai = 8; + Ax, where 
Bi = —a;/cr; and define c; = —cx. (c) If the step has the form A; = aj X Ax, replace 
it by Ai = Ax (the step will be deleted later); and define ci = ajc,. (d) If the step has 
the form A; = Aj X Ax, leave it unchanged; and define ci = cjcr. 

After this process is finished, delete all steps of the form A; = Ax, replacing à; by 
Ax in each future step that uses \;. Then add a final step \,41 = 6 x àr, where 6 = cr. 
This is the desired scheme, since it is easy to verify that the new A; are just the old 
ones divided by the factor ci. The 8’s are given functions of the a’s; division by zero 
is no problem, because if any c = 0 we must have c, = 0 (hence the coefficient of x” 
is zero), or else Ax never contributes to the final result. 


35. Since there are at least five parameter steps, the result is trivial unless there is at 
least one parameter multiplication; considering the ways in which three multiplications 
can form uax*, we see that there must be one parameter multiplication and two chain 
multiplications. Therefore the four addition-subtractions must each be parameter steps, 
and exercise 34 applies. We can now assume that only additions are used, and that 
we have a chain to compute a general monic fourth-degree polynomial with two chain 
multiplications and four parameter additions. The only possible scheme of this type 
that calculates a fourth-degree polynomial has the form 


At = Qı + Ao 
A2 = a2 + Ao 
A3 = Ai X A2 
Aa = a3 + AZ 
A5 = aa + AZ 
Ag = Aa X AS 
A7 = a5 + Ab. 
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Actually this chain has one addition too many, but any correct scheme can be put into 
this form if we restrict some of the a’s to be functions of the others. Now Ay has the 
form (a? + Ax + B)(x? + Az + C) + D = zt + 2Ax? + (E + A*)a? + EAx + F, where 
A = aı + a2, B = aia + Q3, C = aia + a4, D = as, E = B + C, F = BC + D; 
and since this involves only three independent parameters it cannot represent a general 
monic fourth-degree polynomial. 


36. As in the solution to exercise 35, we may assume that the chain computes a 
general monic polynomial of degree six, using only three chain multiplications and six 
parameter additions. The computation must take one of two general forms 


Ai = @ı + ào Ay = a1 + Ao 
A2 = a2 + Xo Aq = a2 + Xo 
A3 = Ai X Ag A3 = Ai X A2 
Aa = a3 + ào Aa = a3 + A3 
A5 = 4+ A3 As = 4+ AZ 
Ae = Aa X A5 Ag = Aa X As 
A7 = a5 + AG A7 = a5 + A3 
Ag = Q6 + A6 Ags = Q6 + ro 
Ag = Az X As Ag = Az X As 
A1o = Q7 + Ag À10 = a7 + Ag 


where, as in exercise 35, an extra addition has been inserted to cover a more general 
case. Neither of these schemes can calculate a general sixth-degree monic polynomial, 
since the first case is a polynomial of the form 


(£? + Ax? + Bx + C)(2? + Ax? + Bz + D) +E, 


and the second case is a polynomial of the form 


(24 +2Aa° + (E+ A*)2? + EArt F)(x? + Ax +G) +H; 


both of these involve only five independent parameters. 


37. Let polz) = UnL” +Un_12" + +- --+uo and go(x) = £” + Una" 1 +-+++u9. For 
1 < j < n, divide p;-1(x) by the monic polynomial qj;-1(x), obtaining pj—ı(£) = 
ajqj-1(x) + 8;q;(x). Assume that a monic polynomial q;(x) of degree n — j ex- 
ists satisfying this relation; this will be true for almost all rational functions. Let 
pj(x) = qj-1(x) — xvq;(x). These definitions imply that deg(p,) < 1, so we may let 
Qn+1 = Pn (x). 

For the given rational function we have 


j aj Bi qj (x) p(x) 

0 £? + 8xr+19 x? + 10x + 29 
1 1 2 e+5 3a +19 

2 3 4 1 5 


so u(x)/v(x) = po(2)/qo(x) = 1 +2/(2 +3 + 4/(z + 5)). 

Notes: A general rational function of the stated form has 2n + 1 “degrees of 
freedom,” in the sense that it can be shown to have 2n + 1 essentially independent 
parameters. If we generalize polynomial chains to quolynomial chains, which allow 
division operations as well as addition, subtraction, and multiplication (see exercise 71), 
we can obtain the following results with slight modifications to the proofs of Theorems 
A and M: A quolynomial chain with q addition-subtraction steps has at most q + 1 
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degrees of freedom. A quolynomial chain with m multiplication-division steps has at 
most 2m + 1 degrees of freedom. Therefore a quolynomial chain that computes almost 
all rational functions of the stated form must have at least 2n addition-subtractions, 
and n multiplication-divisions; the method in this exercise is optimal. 


38. The theorem is certainly true if n = 0. Assume that n is positive, and that a 
polynomial chain computing P(x; uo,...,Un) is given, where each of the parameters a; 
has been replaced by a real number. Let A; = Aj X Ax be the first chain multiplication 
step that involves one of uo, ..., Un; such a step must exist because of the rank of A. 
Without loss of generality, we may assume that A; involves un; thus, A; has the form 
houo +--+ hnun + f(x), where ho, ..., hn are real, hn #0, and f(x) is a polynomial 
with real coefficients. (The h’s and the coefficients of f(x) are derived from the values 
assigned to the a’s.) 

Now change step i to A; = a x Ax, where a is an arbitrary real number. (We could 
take a = 0; general a is used here merely to show that there is a certain amount of 
flexibility available in the proof.) Add further steps to calculate 


A= (a f(x) — houo =--> hn—1Un—1)/hn; 


these new steps involve only additions and parameter multiplications (by suitable 
new parameters). Finally, replace A-n—1 = Un everywhere in the chain by this new 
element à. The result is a chain that calculates 


Q(z; Uo,---,Un—1) = P(x; uo,---,Un—1, (a f(x) — houo — --- hn—1Un—1)/hn); 


and this chain has one less chain multiplication. The proof will be complete if we 
can show that Q satisfies the hypotheses. The quantity (a — f(x))/hn leads to a 
possibly increased value of m, and a new vector B’. If the columns of A are Ao, A1, 

.., An (these vectors being linearly independent over the reals), the new matrix A’ 
corresponding to Q has the column vectors 


Ao — (ho/hn) An, TE An—1 — (hn-1/hn) An, 


plus perhaps a few rows of zeros to account for an increased value of m, and these 
columns are clearly also linearly independent. By induction, the chain that computes Q 
has at least n — 1 chain multiplications, so the original chain has at least n. 

[Pan showed also that the use of division would give no improvement; see Problemy 
Kibernetiki 7 (1962), 21-30. Generalizations to the computation of several polynomials 
in several variables, with and without various kinds of preconditioning, have been given 
by S. Winograd, Comm. Pure and Applied Math. 23 (1970), 165-179.] 

39. By induction on m. Let wm(x) = ee” 4 tam —-12 + +++ + uo, Wm-1(2) = 
wr? + Yom 9023 4 vo, a = A1 + Ym, b = Am, and let 


2m—1 


f(r) = D a 8 ia (i) urtit atb’. 


It follows that vr = f(r + 2) for r > 0, and ôm = f(1). If ôm = 0 and a is given, 
we have a polynomial of degree m — 1 in b, with leading coefficient +(uzm-1 — ma) = 
(y2 +++++ Ym — Mm). 

In Motzkin’s unpublished notes he arranged to make 6, = 0 almost always, by 
choosing y’s so that this leading coefficient is #4 0 when m is even and = 0 when m is 
odd; then we can almost always let b be a (real) root of an odd-degree polynomial. 
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40. No; S. Winograd found a way to compute all polynomials of degree 13 with only 
7 (possibly complex) multiplications [Comm. Pure and Applied Math. 25 (1972), 455- 
457]. L. Revah found schemes that evaluate almost all polynomials of degree n > 9 
with |n/2| + 1 (possibly complex) multiplications [SICOMP 4 (1975), 381-392]; she 
also showed that when n = 9 it is possible to achieve |n/2|+1 multiplications only with 
at least n + 3 additions. By appending sufficiently many additions (see exercise 39), 
the “almost all” and “possibly complex” provisos disappear. V. Y. Pan [STOC 10 
(1978), 162-172; IBM Research Report RC7754 (1979)] found schemes with |n/2| +1 
(complex) multiplications and the minimum number n+2+ ông of (complex) additions, 
for all odd n > 9; his method for n = 9 is 


v(x) = ((z +a)? +8)(z +7), w(x) = v(x) + z, 
ti(x) = (v(x) + ô )(w(z) +1), ta(w) = (v(x) + ô2)(w(x) + e2), 
u(x) = (tı (x) + ¢)(ta(@) — ti(£) +n) + s. 


The minimum number of real additions necessary, when the minimum number of (real) 
multiplications is achieved, remains unknown for n > 9. 

41. a(c+d)—(a+b)d+ i(a(c+d)+(b—a)c). [Beware of numerical instability. Three 
multiplications are necessary, since complex multiplication is a special case of (71) with 
p(u) = u2 +1. Without the restriction on additions there are other possibilities. For 
example, the symmetric formula ac — bd + i((a + b)(c + d) — ac — bd) was suggested by 
Peter Ungar in 1963; Eq. 4.3.3-(2) is similar, with 2” in the role of i. See I. Munro, 
STOC 8 (1971), 40-44; S. Winograd, Linear Algebra and Its Applications 4 (1971), 
381-388. ] 

Alternatively, if a?+b? = 1 and t = (1—a)/b = b/(1+a), the algorithm “w = c—td, 
v = d+ bw, u = w — tv” for calculating the product (a + bi)(c+ di) = u + iv has been 
suggested by Oscar Buneman |J. Comp. Phys. 12 (1973), 127-128]. In this method if 
a =cos@ and b = sin 9, we have t = tan(@/2). 

Helmut Alt and Jan van Leeuwen [Computing 27 (1981), 205-215] have shown 
that four real multiplications or divisions are necessary for computing 1/(a + bi), and 
four are sufficient for computing 

a a . (c/b)a 
b+ci  bFelc/b)  ‘b+e(c/b) 
Six multiplication-division operations and three addition-subtractions are necessary 
and sufficient to compute (a+ bi)/(c+ di). [T. Lickteig, SICOMP 16 (1987), 278-311]. 

In spite of these lower bounds, one should remember that complex arithmetic 
need not be implemented in terms of real arithmetic. For example, the time needed to 
multiply two n-place complex numbers is asymptotically only about twice the time to 
multiply two n-place real numbers, using fast Fourier transforms. 


42. (a) Let m1, ..., Tm be the »,’s that correspond to chain multiplications; then 
mti = Poi-1 X Px and u(x) = Pzm+1, where each P; has the form 8; + Bjox + 8j171 + 
++ ++6jr(7) Trg), Where r(j) < [j/2]— 1 and each of the 6; and Bjk is a polynomial in the 
a’s with integer coefficients. We can systematically modify the chain (see exercise 30) 
so that 6; = 0 and Bir) = 1, for 1 < j < 2m; furthermore we can assume that (30 = 0. 
The result set now has at most m+ 1+ Die a2 — 1) = m? + 1 degrees of freedom. 

(b) Any such polynomial chain with at most m chain multiplications can be 
simulated by one with the form considered in (a), except that now we let r(j) = [j/2]—1 
for 1 < j < 2m + 1, and we do not assume that 630 = 0 or that Bjr) = 1 for j > 3. 
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This single canonical form involves m? + 2m parameters. As the a’s run through all 
integers and as we run through all chains, the 8’s run through at most 2m°+2m sets of 
values mod 2, hence the result set does also. In order to obtain all 2” polynomials 
of degree n with 0-1 coefficients, we need m? + 2m > n. 

(c) Set m + |,/n| and compute x”, 2°, ..., 2 Let u(x) = umyi(x)eortD™ 4 
+++ + u1(z)a”™ + uo(x), where each uj(x) is a polynomial of degree < m with integer 
coefficients (hence it can be evaluated without any more multiplications). Now evaluate 
u(x) by rule (2) as a polynomial in 2” with known coefficients. (The number of 
additions used is approximately the sum of the absolute values of the coefficients, so 
this algorithm is efficient on 0-1 polynomials. Paterson and Stockmeyer also gave 
another algorithm that uses about /2n multiplications.) 

References: SICOMP 2 (1973), 60-66; see also J. E. Savage, SICOMP 8 (1974), 
150-158; J. Ganz, SICOMP 24 (1995), 473-483. For analogous results about additions, 
see Borodin and Cook, SICOMP 5 (1976), 146-157; Rivest and Van de Wiele, Inf. Proc. 
Letters 8 (1979), 178-180. 


43. When a; = a; + a, is a step in some optimal addition chain for n + 1, compute 
x = ax" and pi = px + pj, where pi = x’ '|+---+2+1; omit the final calculation 
of «"*, We save one multiplication whenever az = 1, in particular when i = 1. (See 


exercise 4.6.3-31 with e = 5.) 


44. Let l = |lgn], and suppose zx, x7, £f, ..., £” have been precomputed. If u(x) 
is monic of degree n = 2m +1, we can write u(x) = (2”*! + a)v(x) + w(x), where 
v(x) and w(x) are monic of degree m. This yields a method for n = 2!+1— 1 > 3 that 
requires 2! — 1 further multiplications and 2'+! + 2!-1 — 2 additions. If n = 2! we can 
apply Horner’s rule to reduce n by 1. And if m = 2! < n < 2!+1— 1, we can write 
u(x) = 2 v(x) + w(x) where v and w are monic of degrees n — m and m, respectively; 
by induction on l, this requires at most in +1 — 1 multiplications and on additions, 
after the precomputation. [See S. Winograd, IBM Tech. Disclosure Bull. 13 (1970), 
1133-1135,] 

Note: It is also possible to evaluate u(x) with $n + O(./n) multiplications and 
n + O(./n) additions, under the same ground rules, if our goal is to minimize multi- 
plications + additions. The generic polynomial 


en ((. (Ge? + ao) (2t! + 61) + a1) (2t? + b2) 
+02) +++) (a + Ba) + ary (0? + Bo) 


“covers” the coefficients of exponents {j, j + k,j +k + (k-—1),... j +k+(k-—1)+ 
-+ (j +1), m —k,m'—k+1,...,m' — j}, where 


m=mtj+ G+) (Pt) (2), 


By adding together such polynomials Pıkmı (x), Dokm2(X), ---, Pkkm, (£) for mj = 
oe) + aru? we obtain an arbitrary monic polynomial of degree k? +k+1. [Rabin 
and Winograd, Comm. on Pure and Applied Math. 25 (1972), 433-458, §2; this paper 
also proves that constructions with $n + O(logn) multiplications and < (1 + e)n 
additions are possible for all € > 0, if n is large enough.] 


45. It suffices to show that (Tijx)’s rank is at most that of (tijk), since we can obtain 
(tijk) back from (Tijk) by transforming it in the same way with Ft, Gt, HH. If 
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tijk = J i1 QibjiCkı then it follows immediately that 


Tijk = Dreier (Dogan Fis Girt) (jar Cay bjn) Sopra Hew’ ckr). 


[H. F. de Groote has proved that all normal schemes that yield 2 x 2 matrix 
products with seven chain multiplications are equivalent, in the sense that they can be 
obtained from each other by nonsingular matrix multiplication as in this exercise. In 
this sense Strassen’s algorithm is unique. See Theor. Comp. Sci. 7 (1978), 127—148.] 


46. By exercise 45 we can add any multiple of a row, column, or plane to another one 
without changing the rank; we can also multiply a row, column, or plane by a nonzero 
constant, or transpose the tensor. A sequence of such operations can always be found to 
. 00) /00) /10\/00) /10)\/00) /10)/00 
reduce a given 2 x2 x 2 tensor to one of the forms (8 o lo a lo o) ( HP (5 D) (9 as G 0) G 1) 


G °) (° t). The last tensor has rank 3 or 2 according as the polynomial uf — ru — q has 


one or two irreducible factors in the field of interest, by Theorem W (see (74)). 

47. A general m x n x s tensor has mns degrees of freedom. By exercise 28 it is 
impossible to express all m x n x s tensors in terms of the (m + n + s)r elements of 
a realization (A, B,C) unless (m + n + s)r > mns. On the other hand, assume that 
m > n > s. The rank of an m x n matrix is at most n, so we can realize any tensor in 
ns chain multiplications by realizing each matrix plane separately. [Exercise 46 shows 
that this lower bound on the maximum tensor rank is not best possible, nor is the 
upper bound. Thomas D. Howell (Ph.D. thesis, Cornell Univ., 1976) has shown that 
there are tensors of rank > [mns/(m +n + s — 2) | over the complex numbers.] 


48. If (A, B,C) and (A’, B’,C’) are realizations of (tijk) and (t;jp) of respective lengths 
rand r’, then A” = A@A’, B" = BOB’, C" = C@C', and A” = A@A’, B" = B@B’, 
C” = C&C’, are realizations of (tjp) and (t/;,) of respective lengths r +r’ and r-r’. 

Note: Many people have made the natural conjecture that rank((tijx) ® (tij,)) = 
rank(tijx) + rank(t;;,), but the constructions in exercise 60(b) and exercise 65 make 


this seem much less plausible than it once was. 


49. By Lemma T, rank(tij,) > rank(tij~)). Conversely if M is a matrix of rank r 
we can transform it by row and column operations, finding nonsingular matrices F 
and G such that FMG has all entries 0 except for r diagonal elements that are 1; 
see Algorithm 4.6.2N. The tensor rank of FMG is therefore < r; and it is the same as 
the tensor rank of M, by exercise 45. 

50. Let i = (i’,i”) where 1 < i’ < mand 1 < i” < n; then ty i) jn = 647557", and 
it is clear that rank(t;(;~)) = mn since (t;(;~)) is a permutation matrix. By Lemma L, 
rank(tijk) > mn. Conversely, since (tijk) has only mn nonzero entries, its rank is 
clearly < mn. (There is consequently no normal scheme requiring fewer than the mn 
obvious multiplications. There is no such abnormal scheme either [Comm. Pure and 
Appl. Math. 3 (1970), 165-179]. But some savings can be achieved if the same matrix 
is used with s > 1 different column vectors, since this is equivalent to (m x n) times 
(n x s) matrix multiplication.) 


51. (a) sı = yot yi, S2 = Yo — Y1; Mı = $ (£o + £1)51, M2 = $(zo — z1)s2; Wo = 
mı +mM2, wi = Mı — M2. (b) Here are some intermediate steps, using the methodology 
in the text: ((ao — z2) + (a1 — z2)u)((yo — y2) + (yı — y2)u) mod (u? + u + 1) = 
((ao — x2)(yo — y2) — (£1 — £2)(y1 — y2)) + ((£o — £2)(yo — y2) — (z1 — £o) (yı — yo) Ju. 
The first realization is 


1110 1110 1112 
1011], 1011}, 1121]|x 
L101 1101 12311 


wi = 
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The second realization is 


1112 i 1110 1110 
1121) x3, 1101 |, 1011). 
1211 1011 1101 


The resulting algorithm computes sı = Yo + y1, S2 = Yo — Y1, S3 = Y2 — Yo, S4 = Y2 — Y1, 
1 1 1 
§5 = S1 +Y2; Mı = 3(£o +2t1+22) 85, M2 = 3(tot+a1—22%2) 82, m3 = 3 (to —221+22) 83, 
1 
ma = 3(—2%0+ 21+ £2) 84; tı = Mı + ma, t2 = mı — M2, t3 = mı +mMs3, Wo = tı — Ms, 


wi = t3 + Ma, We = t2 — M4. 


52. Let k = (k’,k”) when k mod n’ = k’ and k mod n” = k”. Then we wish to compute 
Welk) = ÈO Euran Yyg jn Summed for i + j’ = k’ (modulo n’) and i” + j” = k” 
(modulo n”). This can be done by applying the n’ algorithm to the 2n’ vectors X; 
and Y; of length n”, obtaining the n’ vectors Wx’. Each vector addition becomes n” 
additions, each parameter multiplication becomes n” parameter multiplications, and 
each chain multiplication of vectors is replaced by a cyclic convolution of degree n”. [If 
the subalgorithms use the minimum number of chain multiplications over the rationals, 
this algorithm uses 2(n’ — d(n’))(n” — d(n”)) more than the minimum, where d(n) is 
the number of divisors of n, because of exercise 4.6.2-32 and Theorem W.] 


53. (a) Let n(k) = (p—1)p**"" = y(p**) for 0 < k < e, and n(k) = 1 for k > e. 
Represent the numbers {1,...,m} in the form atp” (modulo m), where 0 < k < e and 
0 <i<n(k), and a is a fixed primitive element modulo p°. For example, when m = 9 
we can let a = 2; the values are {2°3°, 213°, 2°31, 273°, 2°3°, 2131 243°, 2939 29371, 
Then f(a'’p*) = X o<i<e D o<j<n() wItIkD PCat p!) where gli, j, k,l) = atti ptt, 

We shall compute fii = X o<j<n(1) wid FGI p') for 0 < i< n(k) and for each 
k and l. This is a cyclic convolution of degree n(k + l) on the values z; = w2'P"*' and 
Ys = J o<jcnalS + J =0 (modulo n(k + 1))] F(afp'), since fiki is > ays summed over 
r+s = i (modulo n(k+1)). The Fourier transform is obtained by summing appropriate 
fixt’s. [Note: When linear combinations of the x; are formed, for example as in (69), 
the result will be purely real or purely imaginary, when the cyclic convolution algorithm 
has been constructed by using rule (59) with u”™® —1 = (u™)/? —1)(u™®/? +1). The 
reason is that reduction mod (u) ? _1) produces a polynomial with real coefficients 
wI + w7 while reduction mod (u™/? + 1) produces a polynomial with imaginary 
coefficients w — w~/] 

When p = 2 an analogous construction applies, using the representation (—1)'a? gk 
(modulo m), where 0 < k < e and 0 < i < min(e — k,1) and 0 < j < 2°-*~”. In this 
case we use the construction of exercise 52 with n’ = 2 and n” = 2°-*~?: although 
these numbers are not relatively prime, the construction does yield the desired direct 
product of cyclic convolutions. 


d 


(b) Let a’m! +a”m" = 1; and let w! = wt", w"” = wT ™', Define s’ = s mod m’, 

s” = smod m”, t = tmodm’, t” = tmod m”, so that wst = (w) t (w")s”t”. It 
R ue tari è 

follows that f(s’, s”) = S775) Do (w) t (w) t F(t, t"); in other words, the 


one-dimensional Fourier transform on m elements is actually a two-dimensional Fourier 
transform on m’ x m” elements, in slight disguise. 

We shall deal with “normal” algorithms consisting of (i) a number of sums s; of 
the F’s and s’s; followed by (ii) a number of products mj, each of which is obtained 
by multiplying one of the F’s or S’s by a real or imaginary number a;; followed by 
(iii) a number of further sums t, each of which is formed from m’s or ts (not F’s or 
s’s). The final values must be m’s or t’s. For example, the “normal” Fourier transform 
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scheme for m = 5 constructed from (69) and the method of part (a) is as follows: 
sı = F(1) + F(4), s2 = F(3) + F(2), s3 = sı + s2, s4 = sı — s2, 85 = F(1) — F(4), 
se = F(2)—F(3), s7 = 85—86; mi = 4 ( +w°)s3, m2 = (ww? +w* —w*) 84, 
ma = $(w+w? — wt — w)s5, ma = 4 Hw? + wt — w) se, ms = Hw? — w)s7, 
me = 1- F(5), m7 =1- s3; to = Mı + Me, ti = to + M2, t2 = m3 + Ms, t3 = to — m2, 
t4 = M4 — Ms, th = tı + ta, te = t3 + ta, t7 = tı — t2, ts = t3 — ta, tg = Me + m7. 
Note the multiplication by 1 shown in mg and m7; this is required by our conventions, 
and it is important to include such cases for use in recursive constructions (although 
the multiplications need not really be done). Here me = fooi, m7 = foio, ts = 
fooo + fooi = f(2°), te = fioo + fior = f(21), etc. We can improve the scheme by 
introducing ss = s3 + F(5), replacing mi by (4 (w +w? +w*+w?*) —1)s3 [this is —ž s3], 
replacing me by 1-sg, and deleting my and tg; this saves one of the trivial multiplications 
by 1, and it will be advantageous when the scheme is used to build larger ones. In the 
improved scheme, f(5) = me, f(1) = ts, f(2) = te, f(3) = ts, f(4) = tr. 

Now suppose we have normal one-dimensional schemes for m’ and m”, using 
respectively (a’, a”) complex additions, (t',t”) trivial multiplications by +1 or +i, and 
a total of (c’,c”) complex multiplications including the trivial ones. (The nontrivial 
complex multiplications are all “simple” since they involve only two real multiplications 
and no real additions.) We can construct a normal scheme for the two-dimensional 
m’ x m” case by applying the m’ scheme to vectors F(t’, *) of length m”. Each si 
step becomes m” additions; each mj becomes a Fourier transform on m” elements, 
but with all of the a’s in this algorithm multiplied by aj; and each tp becomes m” 
additions. Thus the new algorithm has (a’m” + c'a”) complex additions, t't” trivial 
multiplications, and a total of c'c” complex multiplications. 

Using these techniques, Winograd has found normal one-dimensional schemes for 
the following small values of m with the following costs (a, t, c): 


m=2 ( 2,2,2) m=7 (36,1, 9) 
m=3 (6,1,3) m=8 (26,6, 8) 
m=4 (8,4,4) m=9 (46,1, 12) 
m=5 (17,1,6) m=16 (74,8, 18) 


By combining these schemes as described above, we obtain methods that use fewer 
arithmetic operations than the “fast Fourier transform” (FFT) discussed in exercise 14. 
For example, when m = 1008 = 7-9-16, the costs come to (17946, 8, 1944), so we can do 
a Fourier transform on 1008 complex numbers with 3872 real multiplications and 35892 
real additions. It is possible to improve on Winograd’s method for combining relatively 
prime moduli by using multidimensional convolutions, as shown by Nussbaumer and 
Quandalle in IBM J. Res. and Devel. 22 (1978), 134-144; their ingenious approach 
reduces the amount of computation needed for 1008-point complex Fourier transforms 
to 3084 real multiplications and 34668 real additions. By contrast, the FFT on 1024 
complex numbers involves 14344 real multiplications and 27652 real additions. If the 
two-passes-at-once improvement in the answer to exercise 14 is used, however, the FFT 
on 1024 complex numbers needs only 10936 real multiplications and 25948 additions, 
and it is not difficult to implement. Therefore the subtler methods are faster only on 
machines that take significantly longer to multiply than to add. 

[References: Proc. Nat. Acad. Sci. USA 73 (1976), 1005-1006; Math. Comp. 32 
(1978), 175-199; Advances in Math. 32 (1979), 83-117; IEEE Trans. ASSP-27 (1979), 
169-181.] 


54. max(2e:deg(p1) — 1,...,2egdeg(p,) — 1,q¢ +1). 
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55. 2n’ — q’, where n’ is the degree of the minimum polynomial of P (the monic 
polynomial jz of least degree such that u(P) is the zero matrix) and q’ is the number 
of distinct irreducible factors it has. (Reduce P by similarity transformations.) 
56. Let tijk + tjik = Tijk + Tjik, for all i, j, k. If (A, B,C) is a realization of (ti;x) 
of rank r, then X; cri( >>; auxi)(d), bj; ) = Diy tijpr@ity = ag Tigktix; for all k. 
Conversely, let the [th chain multiplication of a polynomial chain, for 1 < L < r, be 
the product (ai + Be Qil xi) (Bi + D Bazi), where a; and {; denote possible constant 
terms and/or nonlinear terms. All terms of degree 2 appearing at any step of the chain 
can be expressed as a linear combination )7/_, c1 ©; Qil La) Oe bjizj); hence the chain 
defines a tensor (tij) of rank < r such that tijk +tjik = Tijk +Tjik. This establishes the 
hint. Now rank(Tijk + Tjik) = rank(tijk + tjik) < rank(tijk) + rank(tjik) = 2 rank(tijk). 
A bilinear form in z1, ..., Zm, Yi, ---; Yn is a quadratic form in m + n variables, 
where Tijk = tij-m,k for i < m and j > m, otherwise Tije = 0. Now rank(Tijk) + 
rank(Tjir) > rank(ti;,), since we obtain a realization of (tijk) by suppressing the last 
n rows of A and the first m rows of B in a realization (A, B,C) of (Tijk + Tjik). 


57. Let N be the smallest power of 2 that exceeds 2n, and let un+i = +- = un-1 = 
Und = = vya = 0. EUS = NG wu and Ve = Ny wu for 0< s <N, 


where w = e?"/", then pele wU: Vs = N >> ut, Vt, where the latter sum is taken 
over all tı and t2 with 0 < ti, te < N, tı +t2 = t (modulo N). The terms vanish unless 
ti < n and t2 < n, so ti + t2 < N; thus the sum is the coefficient of z‘ in the product 
u(z)v(z). If we use the method of exercise 14 to compute the Fourier transforms and 
the inverse transforms, the number of complex operations is O(N log N)+O(N log N)+ 
O(N) + O(N log N); and N < 4n. [See Section 4.3.3C and the paper by J. M. Pollard, 
Math. Comp. 25 (1971), 365-374.] 

When multiplying integer polynomials, it is possible to use an integer number w 
that is of order 2‘ modulo a prime p, and to determine the results modulo sufficiently 
many primes. Useful primes in this regard, together with their least primitive roots r 
from which we take w = r(e-1)/2" mod p when p mod 2! = 1), can be found as described 
in Section 4.5.4. For t = 9, the ten largest cases < 2° are p = 2°% — 512a + 1, 
where (a,r) = (28,7), (31,10), (34,13), (56,3), (58, 10), (76,5), (80,3), (85,11), (91,5), 
101, 3); the ten largest cases < 23t are p = 23! — 512a + 1, where (a,r) = (1,10), 
11,3), (19,11), (20,3), (29,3), (35,3), (55,19), (65,6), (95,3), (121,10). For larger t, 
all primes p of the form 2¢q + 1 where q < 32 is odd and 274 < p < 2°° are given by 
p —1,r) = (11-274, 3), (25 - 270,3), (27 - 2?°, 5), (25 - 272, 3), (27 - 222,7), (5 - 278, 3), 
7-276 3), (27-276 13), (15 - 2?”,31), (17-277, 3), (3 - 29°, 5), (13 - 278, 3), (29 - 277,3), 
23-279, 5). Some of the latter primes can be used with w = 2° for appropriate small e. 
For a discussion of such primes, see R. M. Robinson, Proc. Amer. Math. Soc. 9 (1958), 
673-681; S. W. Golomb, Math. Comp. 30 (1976), 657-663. Additional all-integer 
methods are cited in the answer to exercise 4.6-5. 

However, the method of exercise 59 will almost always be preferable in practice. 


58. (a) In general if (A, B, C) realizes (tijk), then ((£1,.. . , £m) A, B, C) is a realization 
of the 1 x n x s matrix whose entry in row j, column k is }> Zitijk. So there must be at 
least as many nonzero elements in (£1, ..., £m) A as the rank of this matrix. In the case 
of the m x n x (m+n-—1) tensor corresponding to polynomial multiplication of degree 
m — 1 by degree n — 1, the corresponding matrix has rank n whenever (#1,...,%m) Æ 
(0,...,0). A similar statement holds with A + B and m & n. 

Notes: In particular, if we work over the field of 2 elements, this says that the 
rows of A modulo 2 form a “linear code” of m vectors having distance at least n, 
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whenever (A, B,C) is a realization consisting entirely of integers. This observation, 
due to R. W. Brockett and D. Dobkin [Linear Algebra and Its Applications 19 (1978), 
207-235, Theorem 14; see also Lempel and Winograd, IEEE Trans. IT-23 (1977), 503- 
508; Lempel, Seroussi, and Winograd, Theoretical Comp. Sci. 22 (1983), 285-296], can 
be used to obtain nontrivial lower bounds on the rank over the integers. For example, 
M. R. Brown and D. Dobkin [IEEE Trans. C-29 (1980), 337-340] have used it to show 
that realizations of n x n polynomial multiplication over the integers must have rank 
> an for all sufficiently large n, when a is any real number less than 


Amin = 3.52762 68026 32407 48061 54754 08128 07512 70182+; 


here Qmin = 1/H(sin? 0,cos? 0), where H(p,q) = plg(1/p) + qlg(1/q) is the binary 
entropy function and 0 ~ 1.34686 is the root of sin?(@ — 7/4) = H (sin? 0, cos? 0). An 
all-integer realization of rank O(nlogn), based on cyclotomic polynomials, has been 
constructed by M. Kaminski [J. Algorithms 9 (1988), 137-147]. 


10000000 

10000111 11000100 

w (opogrror), (92000101) | 11100010 
00110011 >1o00100011]’}] 10011111 
00011001 00101000 

00010000 


The following economical ways to realize the multiplication of general polynomials 
of degrees 2, 3, and 4 have been presented by H. Cohen and A. K. Lenstra [see Math. 
Comp. 48 (1987), S1—-S2]: 


100000 
100110 110100 
010101], same, 111010]; 
001011 011001 

001000 


100000000 

110010000 
100011001 eit cee 

111001000 
010010011 Be iri aida 

, same, VA. de dD, Ded. Dds Yi 

001001101 aa 

011100010 
000100111 eat: 

001100100 

000100000 

10000000000000 

11010000000000 


10011010110000 11101000000000 
01010101101000 11100110000100 
00101100011000 |, same, 11110011100111 
00000010110101 11001011010010 
00000001101011 01000101001100 


00000000000111 
00000000000010 
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In each case the A and B matrices are identical. 


59. [IEEE Trans. ASSP-28 (1980), 205-215.] Note that cyclic convolution is polyno- 
mial multiplication mod u” —1, and negacyclic convolution is polynomial multiplication 
mod u” +1. Let us now change notation, replacing n by 2”; we shall consider recursive 
algorithms for cyclic and negacyclic convolution (z0,..., Z2»—1) of (£o, ..., £2n—1) with 
(yo,---,Y2r-1). The algorithms are presented in unoptimized form, for brevity and 
ease in exposition; readers who implement them will notice that many things can be 
streamlined. For example, the final value of Z2m-1(w) in step N5 will always be zero. 


C1. [Test for simple case.] If n = 1, set 


Zo + Loyo + 21y1, zı + (£o + 21)(yo + y1) — 20, 


and terminate. Otherwise set m + 2771. 


C2. [Remainderize.] For 0 < k < m, set (£k, &m4k) < (£k + Em4k, Lk — Lm+k) 
and (Yk, Ym+k) +} (Yk +Ym+k, Yk —Ym+k). (Now we have z(u) mod (u”—1) = 
zo +: + a@m_1u™ | and x(u) mod (u™ +1) = tm +--+ + %am-1u™ ~; we 
will compute x(u)y(u) mod (u™ — 1) and z(u)y(u) mod (u™ +1), then we will 
combine the results by (59).) 


C3. [Recurse.] Set (zo,...,2m-—1) to the cyclic convolution of (xo,...,@m—1) with 
YO;+++;Ym—1). Also set (2m,.-.,;22m-1) to the negacyclic convolution of 
Lmy+++;L2m—1) with (ym,.--,Y2m—1)- 


C4. [Unremainderize.] For 0 < k < m, set (Zk, 2m+n) < 4 (2k + 2mtk; Zk — Zm+k)- 
Now (20,...,Z2m—1) is the desired answer. J 


N1. [Test for simple case.] If n = 1, set t + xo(yo + y1), zo + t — (zo + 21) 1, 
zı + t+(x1—20) yo, and terminate. Otherwise set m + 217/21 andr 4+ 2/21, 
(The following steps use 2”*t auxiliary variables Xi; for 0 < i < 2m and 0 < 
j < r, to represent 2m polynomials X;(w) = Xio +t Xiwt---4 Xir-pwTt; 
similarly, there are 2”*' auxiliary variables Y;j.) 


N2. [Initialize auxiliary polynomials.] Set Xi; < X(itm)y + Emyti, Yij — 
Yustm)j — Ymj+i, for 0 <i < mandO0 <j <r. (At this point we have 
rlu) = Xo(u™) + uXi(u™) +- +u” 1 Xm_i(u™), and a similar formula 
holds for y(u). Our strategy will be to multiply these polynomials modulo 
u™” +1) = (uw2" +1), by operating modulo (w” + 1) on the polynomials 
X(w) and Y(w), finding their cyclic convolution of length 2m and thereby 
obtaining z(u)y(u) = Zo(u™) + uZ (u™) +- +?! Zom—i(u™).) 
N3. [Transform.] (Now we will essentially do a fast Fourier transform on the poly- 
nomials (Xo,...,Xm_1,0,...,0) and (Y,...,Ym—1,0,...,0), using w’/™ asa 
2m)th root of unity. This is efficient, because multiplication by a power of w 
is not really a multiplication at all.) For j = |n/2|—1, ..., 1, 0 (in this order), 
do the following for all m binary numbers s +t = (8in/2).--$j410-..0)2 + 
0...0tj;-1...to)2: Replace (X.42(w), Xs4+42) (w)) by the pair of polynomi- 
als (Xs42(w) + w/s X, tpi (w), Xs4e(w) — w/m X45; (w)), where 
s! = 2/(8;41..-8{n/2|)2. (We are evaluating 4.3.3-(39), with K = 2m and 
w = w"/m; notice the bit-reversal in s’. The polynomial operation X;(w) + 
X;(w) + w" Xi(w) means, more precisely, that we set Xij + Xij + Xi¢j—x) for 
k<j<r,and Xij + Xij — Xig—rtr) for 0 < j < k. A copy of Xi(w) can be 
made without wasting much space.) Do the same transformation on the Y’s. 
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N4. [Recurse.] For 0 < i < 2m, set (Zjo,..., Zi¢p—1)) to the negacyclic convolution 
of (Xio, e.. Xir—1)) and (Yio, wey Yi(r—1))- 

N5. [Untransform.] For j7=0, 1, ..., |n/2] (in this order), and for all m choices 
of s and t as in steps N3, set (Zs+1(w), 2542423 (w)) to 


5(Zs44(w) + Zettai (w), w/s (Zs+t(w) — Zs4t+2i (w))). 


N6. [Repack.] (Now we have accomplished the goal stated at the end of step N2, 
since it is easy to show that the transform of the Z’s is the product of the 
transforms of the X’s and the Y’s.) Set z; + Zio — Z(m+i)(r-1) and Zmj+i + 
Zij + Zím+i(j-1) fr 0< j <r, fr0<i<m. I 

It is easy to verify that at most n extra bits of precision are needed for the 
intermediate variables in this calculation; for example, if |z;| < M for 0 < i < 2” 
at the beginning of the algorithm, then all of the x and X variables will be bounded 
by 2” M throughout. All of the z and Z variables will be bounded by (2”M)*, which 
is n more bits than required to hold the final convolution. 

Algorithm N performs An addition-subtractions, Dn halvings, and M» multipli- 
cations, where Ai; = 5, Di = 0, Mı = 3; for n > 1 we have An = |n/2] grte 
aln/2)+1 Ar, 727 + ([n/2] +1)2"*7 + 2”, Dn = 2L"/2]41D iyo + ([n/2] + 1)2"*7, and 
Mn = 21"/2\+1Miy/2]. The solutions are An = 11-2%-1+!len] — 3-2" + 6 - 2” Sn, 
Dn = 4. 2n-1+[lgn] — 2-2" + 2 . 2” Sn, Mn = 3-2nr-1+llsn71; here Sn satisfies the 
recurrence S; = 0, Sn = 2Sfn/2] + |n/2], and it is not difficult to prove the inequalities 
infig n| < Sn < Sny < inlgn +n forall n > 1. Algorithm C does approximately 
the same amount of work as Algorithm N. 


60. (a) In £, for example, we can group all terms having a common value of j and k 
into a single trilinear term; this gives v? trilinear terms when (j,k) € EXE, plus v? 
when (j,k) € ExO and v? when (j,k) € OxE. When j = k we can also include 
—zjjYjzZjj in X1, free of charge. [In the case n = 10, the method multiplies 10 x 10 
matrices with 710 noncommutative multiplications; this is almost as good as seven 
5 x 5 multiplications by the method of Makarov cited in the answer to exercise 12, 
although Winograd’s scheme (35) uses only 600 when commutativity is allowed. With 
a similar scheme, Pan showed for the first time that M (n) < n”? for all large n, and 
this awakened great interest in the problem. See SICOMP 9 (1980), 321-342.] 

(b) Here we simply let S be all the indices (i, j, k) of one problem, S$ the indices 
[k, i, j] of the other, and work with an (mn+sm) x (ns+mn) x (sm+ns) tensor. [When 
m = n = s = 10, the result is quite surprising: We can multiply two separate 10 x 10 
matrices with 1300 noncommutative multiplications, while no scheme is known that 
would multiply each of them with 650.] 
61. (a) Replace aj(u) by uaj(u). (b) Let au(u) = >0,, upu”, etc., in a polynomial 
realization of length r = ranka(tijx). Then tijk = D er GitubdjwCkic. [This 
result can be improved to rank(tijx) < (2d + 1) ranka(tijx) in an infinite field, because 
the trilinear form Dog +v+o=q %nbvCo corresponds to multiplication of polynomials mod- 
ulo u“*1, as pointed out by Bini and Pan. See Calcolo 17 (1980), 87-97.] (c,d) This is 
clear from the realizations in exercise 48. 

(e) Suppose we have realizations of t and rt’ such that Yia QiubjiCki = tiju? + 
O(ut**) and DFi Avy ByyyLCwre = [i=j =k] thy jut +O(u”+). Then 


Ee ir # P 
5 D ai Aliny 5 bjmBimj')L 5 CknC (nk!) L = tijrti jpu t? + O(utt4 ms 
m=1 n=1 


L=1 l=1 
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62. The rank is 3, by the method of proof in Theorem W with P = (3 ae The border 
rank cannot be 1, since we cannot have aı(u)bı(u)cı(u) = a1(u)be(u)co(u) = u” and 
aı(u)b2(u)cı (u) = aı(u)bı(u)c2(u) = 0 (modulo u“*!). The border rank is 2 because 
of the realization G els (¢ P ; k z : 

The notion of border rank was introduced by Bini, Capovani, Lotti, and Romani 
in Information Processing Letters 8 (1979), 234-235. 


63. (a) Let the elements of T(m,n,s) and T(M, N, S) be denoted by ti j’) (5,47) (kin) 
and Tr 7’) (7,K")(K,1/), respectively. Each element T(z,.7/)(7,/)(x,z") Of the direct prod- 
uct, where Z = (i, I), J = (j,J), and K = (k,K), is equal to t4,5")(5,4/) (ki) X 
Tr,3')(3,K')(K,r') by definition, so it is [Z =Z and J'=J7 and K’ = K]. 

(b) Apply exercise 61(e) with M(N) = rank,(T(N, N, N)). 

(c) We have M(mns) < r°, since Eres mns,mns) = T(m,n,s) ® T(n, s,m) Q 
T(s,m,n). If M(n) < R we have M(n") < R” for all h, and it follows that M(N) < 
M(nllesn Nl) < Rogn N] < p R/logn, [This result appears in Pan’s paper of 1972.] 

(d) We have Ma(mns) < r? for some d, where Ma(n) = ranka(T(n,n,n)). If 
Ma(n) < R we have Mna(n*) < R" for all h, and the stated formula follows since 
M(n") < a) R” by exercise 61(b). In an infinite field we save a factor of log N. 
[This result is due to Bini and Schénhage, 1979.] 


64. We have $, (fr(u)t+ do jz gj, (u)) = TOD EEES TijYjkZki+O(u’), when f(u) = 
(tei + u? zk2)(Y2k + UYik) Zkk + (tei + wars) y3r((1 + U)Zkk — U(Z1k + Z2k 4 23k)) 
Lei(yor + Y3k)(Zkı + Zk2 + Zk3) and gjk(u) = (£kı + u? xj3)(Y3k +uyiz (2k; + Uzir) + 
(£k1 + u?x;2)(yor — uyij)zkj- [The best upper bound known for rank(T(3,3,3)) is 23; 
see the answer to exercise 12. The border rank of T (2,2,2) remains unknown.] 
65. The polynomial in the hint is u? 7”, jas (Big ig + Xig Vig Z) + O(u3). Let Xij 
and Y;; be indeterminates for 1 <i < mand 1 < j < n; also set Xin = Ymj = 0, Xmj = 
es Xij, Yin = — ea Y,;- Thus with mn + 1 multiplications of polynomials in 
the T we can compute x;y; for each ¢ and j and also X ;4; D KaYa 
yee =, XijYij. [SICOMP 10 (1981), 434-455. In this classic paper Schénhage 
also derived, among other things, the results of exercises 64, 66, and 67(i).] 


66. (a) Let w = liminfn_,. log M(n)/log n; we have w > 2 by Lemma T. For all e > 0, 
there is an N with M(N) < N**t*. The argument of exercise 63(c) now shows that 
log M(n)/logn < w + 2e for all sufficiently large n. 

b) This is an immediate consequence of exercise 63(d). 

c) Let r = rank(t), q = (mns), Q = (MNS)*/?. Given e > 0, there is an 
integer constant ce such that M(p) < c.p”** for all positive integers p. For every 
integer h > 0 we have t = @, (2)T(m* M” k në NPE, sk SPF), and rank(t") < r”. 
Given h and k, let p = | (})1⁄/+9]. Then 


rank(T(pm* M"~*, pnë N"—*, ps*s"-*)) < rank(M(p)T(m* M” E në NPE, gk gh *)) 
< rank(c.(?)T(m*M"-* në N=, sk gh-*)) 


by exercise 63(b), and it follows from part (b) that 


pg QT k = (pm k M” Fon N’ Rosh gh i < cer”. 
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(: A 


Since p > /2 we have 


e/(w+e) 
(raa < (1) (2p) qQ" < gP Etga rh, 
Therefore (q + Q)” < (h+1)2°°/@+92%c.r" for all h. And it follows that we must 
have q + Q < 2¢/(#+©)r for all € > 0. 

d) Set m = n = 4 in exercise 65, and note that 16°°° + 9%% > 17. 


67. (a) The mn x mns? matrix (tujk) (ki’))) has rank mn because it is a permu- 
tation matrix when restricted to the mn rows for which k = k’ = 1. 

b) ((E®t’)icjny) is essentially (ti(jn)) © (tigr), Plus n's + sn’ additional columns 
of zeros. [Similarly we have ((t Q t’)i¢jx)) = (tigr)) ® (tigr) for the direct product.] 
c) Let D be the diagonal matrix diag(di,...,d,), so that ADB7 = O. We 
know by Lemma T that rank(A) = m and rank(B) = n; hence rank(AD) = m and 
rank(DB7) = n. We can assume without loss of generality that the first m columns 
of A are linearly independent. Since the columns of BT are in the null space of AD, 
we may also assume that the last n columns of B are linearly independent. Write A 
in the partitioned form (Ai A2 A3) where A; is m x m (and nonsingular), A2 is m x q, 
and A3 is m x n. Also partition D so that AD = (Ai Dı A2D2 A3 D3). Then there is 
aq xr matrix W = (W1 I O) such that ADWT =O, namely W; = —Dz2A3 Ay" Djzt. 
Similarly, we may write B = (Bı B2 B3), and we find VDB™ = O when V = (OT V3) 
is the q x r matrix with V3 = —D2 Bz B3" D3". Notice that UDV? = Dz, so the hint 
is established (more or less— after all, it was just a hint). 

Now we let Au(u) = au for 1 < i < m, Ampi (u) = uvi /dm+i; Byi(u) = bj: for 
1<j <n, Baspi(u) = wiu; Crilu) = wee for 1 < k < 8, Cie41y:(u) = di. It follows 
that Y; Au(u)Ba(u)Cri(u) = wtizn + Olu’) if k < s, ufi >m][j >n] if k =s +1. 
[In this proof we did not need to assume that t is nondegenerate with respect to C.] 

(d) Consider the meee realization of T(m, 1,n) with r = mn-+1: ai = [[l/n| = 
i— 1], bj = [[modn= J], buzy = [l= (i—1)n + j], if l < mn; air = 1, bjr = —1, 
C(ij)r = 0. This is improvable with dı = 1 for 1 <I <r. 

(e) The idea is to find an improvable realization of T (m,n, s). Suppose (A, B,C) 
is a realization of length r. Given arbitrary integers a1, ..., Q@m, B1, ..., Bs, extend 
A, B, and C by defining 


Aüjr+p) = Ali =p], Byery(rtp) = Leli =p], Ciiye+p) = 0, fr1<p<n. 


If di = Jp Dga Ow bkCikiyi for l< r and d; = —1 otherwise, we have 
r+n 
5 Ayn Bye yidi = 3 Sa Br Yo Aus yt Byer C kini — 3 ailj’ = p] bw [j = p] 
1=1 i'=1 k=1 
= [j=j] abr — [j= j] aib = 0; 
so this is improvable if dı ...dr # 0. But dı...dr is a polynomial in (a1,...,Qm, 
B1,- --, 8s), not identically zero, since we can assume without loss of generality that C 


has no all-zero columns. Therefore some choice of a’s and 8’s will work. 
(£) If M (n) = n” we have M(n") = n””, hence 
rank(T(n",n”,n") @ T(1,n — n” (2n” —1),1)) <n +n”. 
Exercise 66(c) now implies that n’ + (n”® — 2n?” + n”)®/3 < n™® + n” for all h. 


Therefore w = 2; but this contradicts the lower bound 2n? — 1 (see the answer to 
exercise 12). 


4.6.4 ANSWERS TO EXERCISES 717 


(g) Let f(u) and g(u) be polynomials such that the elements of V f(u) and Wg(w) 
are polynomials. Then we redefine 


Aqui = taf) [deem Bas =U glu) /p, Car ena, 
where f(u)g(u) = pu® + O(u°*"). It follows that )7/_, Ai(u) Byi(u) Cxi(u) is equal to 
UTT? tije + Olut’) if k < s, ult??? [i> m][j >n] if k = s+1. [Note: The result 
of (e) therefore holds over any field, if rankz is replaced by rank, since we can choose 
the a’s and §’s to be polynomials of the form 1 + O(u).] 

(h) Let row p of C refer to the component T(1,16,1). The key point is that 
Si ain(u)b;1(u) epi(u) is zero (not simply O(u“t")) for all ¢ and j that remain after 
deletion; moreover, Cpi(w) # 0 for all J. These properties are true in the constructions 
of parts (c) and (g), and they remain true when we take direct products. 

(i) The proof generalizes from binomials to multinomials in a straightforward way. 

(j) After part (h) we have 81°/% + 2(36"/3) + 34/3 < 100, so w < 2.52. Squar- 
ing once again gives rank(7T(81,1,81) © 47(27, 4,27) © 2T(9, 34,9) $ 4T(9, 16,9) ® 
4T (3, 136,3) $ T(1,3334,1)) < 10000; this yields w < 2.4999. Success! Continued 
squaring leads to better and better bounds that converge rapidly to 2.497723729083.... 
If we had started with T(4, 1,4) ®T(1, 9,1) instead of T(3, 1,3) @T(1, 4, 1), the limiting 
bound would have been 2.51096309.... 

[Similar tricks yield w < 2.496; see SICOMP 11 (1982), 472-492. The best current 
bound, w < 2.3727, is due to V. Vassilevska Williams, STOC 44 (2012), 887-898.] 


68. T. M. Vari has shown that n — 1 multiplications are necessary, by proving that 
n multiplications are necessary to compute x? + --- +22 [Cornell Computer Science 
Report 120 (1972)]. C. Pandu Rangan showed that if we compute the polynomial as 
Lı Rı +--+ + En-1Rn-1, where the L’s and R’s are linear combinations of the 2’s, 
at least n — 2 additions are needed to form the L’s and R’s [J. Algorithms 4 (1983), 
282-285]. But his lower bound does not obviously apply to all polynomial chains. 


69. Let yij = xij — [i= J], and apply the recursive construction (31) to the matrix 
I+/Y, using arithmetic on power series in the n? variables yi; but ignoring all terms 
of total degree > n. Each entry h of the array is represented as a sum ho + hi + 
-+-+ hn, where hy is the value of a homogeneous polynomial of degree k. Then every 
addition step becomes n + 1 additions, and every multiplication step becomes ~ in? 
multiplications and % in? additions. Furthermore, every division is by a quantity of 
the form 1 + hi +---+ hn, since all divisions in the recursive construction are by 1 
when the y;; are entirely zero; therefore division is slightly easier than multiplication 
(see Eq. 4.7-(3) when Vo = 1). Since we stop when reaching a 2 x 2 determinant, 
we need not subtract 1 from y;; when j > n — 2. It turns out that when redundant 
computations are suppressed, this method requires 20(%) + 8(7) ł 12(5) 4(3) +5n—4 
multiplications and 20(7) +8(7) +4(3) +24(5) —n additions, thus ¢n>—O(n*) of each. 
A similar method can be used to eliminate division in many other cases; see Crelle 264 
(1973), 184-202. (But the next exercise constructs an even faster divisionless scheme 
for determinants.) 


70. Set A = A—2, B= —u, C = —v, and D = XI — Y in the hinted identity, then 
take the determinant of both sides, using the fact that I/A + Y/A? + Y?/A8 +-+- is 
the inverse of D as a formal power series in 1/A. We need to compute uY *v only for 
0 < k <n-— 2, because we know that fx(A) is a polynomial of degree n; thus, only 
n+ O(n?) multiplications and n3+ O(n?) additions are needed to advance from degree 
n — 1 to degree n. Proceeding recursively, we obtain the coefficients of fx from the 
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elements of X after doing 6(7) + 7(3) + 23 
addition-subtractions. 

If we only want to compute det X = (—1)" fx (0), we save 3(%) — n +1 multiplica- 
tions and (2) additions. This division-free method for determinant evaluation is in fact 
quite economical when n has a moderate size; it beats the obvious cofactor expansion 
scheme when n > 4. 

If w is the exponent of matrix multiplication in exercise 66, the same approach 
leads to a division-free computation in O(n“’t***) steps, because the vectors uY" for 
0 < k < n can be evaluated in O(M(n)logn) steps: Take a matrix whose first 2! 
rows are uY* for 0 < k < 2! and multiply it by Y2’; then the first 2' rows of the 
product are uY* for 2! < k < 2't!. [See S. J. Berkowitz, Inf. Processing Letters 18 
(1984), 147—-150.] Of course such asymptotically “fast” matrix multiplication is strictly 
of theoretical interest. E. Kaltofen has shown how to evaluate determinants with only 
O(n?*%\/M(n) ) additions, subtractions, and multiplications [Proc. Int. Symp. Symb. 
Alg. Comp. 17 (1992), 342-349]; his method is interesting even with M (n) = n°. 


) multiplications and 6(%) + 5(3) + 2(3) 


71. Suppose gi = u1 O V1, ..., Jr = Ur O Ur, and f = aigi +--+: + &rgr + po, where 
Uk = Brigit:+++Be(k—1)Gk-1+Pk; Vk = Yk1g91 +: +H Yk(k-1)9k-1 +, each o is “x” or 
“/”, and each p; or q; is a polynomial of degree < 1 in z1, ..., £n. Compute auxiliary 
quantities Wk, Yk, Zk for k = r, r — 1, ..., 1 as follows: wk = Qk + b(k+1)kYk+1 + 


Y(k+1)kZk+1 +e + BrkYr + YrkZr, and 


Yk = Wk X Uk, Zk = Wk X Uk, if gk = Uk X Uk; 
Yk = Wk / Uk, Zk = —Yk X Gk, if gk = Uk /Vk. 


Then f’ = po + piyr + qizı +- + pyr + qzr, where ” denotes the derivative with 
respect to any of z1, ..., £n. [W. Baur and V. Strassen, Theoretical Comp. Sci. 22 
(1983), 317-330. A related method had been published by S. Linnainmaa, BIT 16 
(1976), 146-160, who applied it to analysis of rounding errors.] We save two chain 
multiplications if gr = ur X ur, since wr = ay. Repeating the construction gives all 
second partial derivatives with at most 9m + 3d chain multiplications and 4d divisions. 


72. There is an algorithm to compute the tensor rank over algebraically closed fields 
like the complex numbers, since this is a special case of the results of Alfred Tarski, 
A Decision Method for Elementary Algebra and Geometry, 2nd edition (Berkeley, 
California: Univ. of California Press, 1951); but the known methods do not make this 
computation really feasible except for very small tensors. Over the field of rational 
numbers, the problem isn’t even known to be solvable in finite time. 


73. In such a polynomial chain on N variables, the determinant of any N x N matrix 
for N of the linear forms known after | addition-subtraction steps is at most 2’. And in 
the discrete Fourier transform, the matrix of the final N = mı ... Mn linear forms has 
determinant N/ 2? since its square is N times a permutation matrix by exercise 13. 
[JACM 20 (1973), 305-306.] 


74. (a) If k =(ki,...,ks)” is a vector of relatively prime integers, so is Uk, since any 
common divisor of the elements of Uk divides all elements of k = U~'Uk. Therefore 
VUk cannot have all integer components. 

(b) Suppose there is a polynomial chain for Vx with t multiplications. If t = 0, the 
entries of V must all be integers, so s = 0. Otherwise let A; = a x Ax or Ay = Aj X Ak 
be the first multiplication step. We can assume that Ak = nizi +---+nsx%s + 8 where 
M1, ..., Ns are integers, not all zero, and @ is constant. Find a unimodular matrix 
U such that (m1,...,ns)U = (0,...,0,d), where d = gcd(ni,...,ns). (The algorithm 
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discussed before Eq. 4.5.2—(14) implicitly defines such a U.) Construct a new polyno- 
mial chain with inputs yi, ..., ys—1 as follows: First calculate x = (x,.. s28)? = 
U(y1,...,Ys-1, —8/d)”, then continue with the assumed polynomial chain for Vz. 
When step i of that chain is reached, we will have Ap = (n1,...,ns)£ + 8 = 0, so 
we can simply set A; = 0 instead of multiplying. After Vz has been evaluated, add 
the constant vector w8/d to the result, where w is the rightmost column of VU, and 
let W be the other s — 1 columns of VU. The new polynomial chain has computed 
Vz + wB6/d=VU(y1,...,yYs—1,—-8/d)” + wB/d = W(y,...,ys—1)*, with t — 1 multi- 
plications. But the columns of W are Z-independent, by part (a); hence t— 1 > s- 1, 
by induction on s, and we have t > s. 

(c) Let x; = 0 for the t — s values of j that aren’t in the set of Z-independent 
columns. Any chain for Vz then evaluates V'z’ for a matrix V’ to which part (b) applies. 

(d) Ai = T — y, A2 = Ar + Ai, A3 = A2+ 4, v4 = (1/6) x A3, As = Aa + Aa, 
às = às +y (= «4+ y/3), A7 = às —A1, As = Av +A4 (= x/2+y). But {x/2+y, c+ y/2} 
needs two multiplications, since the columns of eu ija) are Z-independent. |Journal 
of Information Processing 1 (1978), 125-129.] 


SECTION 4.7 


1. Find the first nonzero coefficient Vm, as in (4), and divide both U (z) and V(z) 
by z2™ (shifting the coefficients m places to the left). The quotient will be a power 
series if and only if Uo =--- = Um-1 = 0. 

2. We have VYW, = W'Un — (VEW) (VET Va) — (VWI (VE? Va) = - 
(V Wn-1)(V? V1). Thus, we can start by replacing (U;,V;) by (VJ U}, VIT 'V;) for 
j > 1, then set W, + Un — 022) WkVn—x for n > 0, finally replace W; by W; / Vj + 
for j > 0. Similar techniques are possible in connection with other algorithms in this 
section. 

3. Yes. When a = 0, it is easy to prove by induction that Wi = W2 = - -- = 0. When 
a = 1, we find W, = Vn, by the cute identity 


5 (> = me 2) Ve Var = Vn Vo. 


k=1 


4. If W(z) = eY®, then W’(z) = V'(z)W (z); we find Wo = e”, and 


Z k 
Wr = —VkWn-k, for n > 1. 
aS 
If W(z) = In V(z), the roles of V and W are reversed; hence when Vo = 1 the rule is 
Wo = 0 and Wn = Va + A (k/n — 1)VkWn-p for n > 1. 
[By exercise 6, the logarithm can be obtained to order n in O(n logn) operations. 
R. P. Brent observes that exp(V(z)) can also be calculated with this asymptotic speed 
by applying Newton’s method to f(x) = ln x — V (z); therefore general exponentiation 
(14+V(z))* = exp(aln(1+V(z))) is O(nlogn) too. Reference: Analytic Computational 
Complexity, edited by J. F. Traub (New York: Academic Press, 1975), 172-176.] 


5. We get the original series back. This can be used to test a reversion algorithm. 


6. d(x) = x + x(1 — xV(z)); see Algorithm 4.3.3R. Thus after Wo, ..., Wn-1 
are known, the idea is to input Vyn, ..., V2v—1, compute (Wo +---+ Wwy-12%~") x 
(Vo +- + Van—-127N71!) = 14+ Roz® +- + Ruai + O(2?%), and let Wy + 


e + Won-1z%7* = —(Wo +--+ + Wn-1z%~*)(Ro + +++ + Rn-1z%~*) + O(2%). 
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[Numer. Math. 22 (1974), 341-348; this algorithm was, in essence, first published 
by M. Sieveking, Computing 10 (1972), 153-156.] Note that the total time for N coef- 
ficients is O(N log N) arithmetic operations if we use “fast” polynomial multiplication 
(exercise 4.6.4—57). 


7. Wn = (F) /n when n = (m — 1)k +1, otherwise 0. (See exercise 2.3.4.4-11.) 
8. G1. Input Gi and Vj; set n + 1, Uo © 1/V1; output Wi = GiUo. 
G2. Increase n by 1. Terminate the algorithm if n > N; otherwise input Vn 
and Gn. 
G3. Set Ux + (Uk — De Uk-;Vj+1)/Vı for k = 0, 1, ..., n — 2 (in this order); 
then set Un—1 + — J p kUn-kVk/Vi. 
G4. Output Wn = J pı kUn-kGk/n and return to G2. J 
(The running time of the order N? algorithm is hereby increased by only order N?.) 
Note: Algorithms T and N determine V'~"(U(z)); the algorithm in this exercise 
determines G(V'"I(z)), which is somewhat different. Of course, the results can all be 
obtained by a sequence of operations of reversion and composition (exercise 11), but it 
is helpful to have more direct algorithms for each case. 


9 n=1 n=2 n=3 n=4 n=5 


Tin 1 il 2 5 14 
Ton 1 5 14 
Tar 1 3 9 
Tan 1 4 
Tsn 1 
10. Form y/o = t(l + ar +4 axr? 4 saj e =a(ltac4 cox? 4 --+) by means of 


Eq. (9); then revert the latter series. (See the remarks following Eq. 1.2.11.3-(11).) 


11. Set Wo < Uo, and set (Tk, Wg) <— (Ve,0) for 1 < k < N. Then for n = 1, 
2, ..., N, do the following: Set W; «+ W; + U,T; forn < j < N; and then set 
T; —TiMt+---+Thv; n for j = N, N 1,...,n4+1. 

Here T(z) represents V (z). An online power series algorithm for this problem, 
analogous to Algorithm T, could be constructed, but it would require about N 2/ 2 
storage locations. There is also an online algorithm that solves this exercise and needs 
only O(N) storage locations: We may assume that V, = 1, if Up is replaced by Up VE 
and Vp is replaced by Vk/Vı for all k. Then we may revert V(z) by Algorithm L, and 
use its output as input to the algorithm of exercise 8 with Gi = U1, Ge = U2, etc., 
thus computing U(V'-N-4(z)) — Uo. See also exercise 20. 

Brent and Kung have constructed several algorithms that are asymptotically faster. 
For example, we can evaluate U(x) for x = V(z) by a slight variant of exercise 4.6.4- 
42(c), doing about 2V N chain multiplications of cost M(N) and about N parameter 
multiplications of cost N, where M(N) is the number of operations needed to multiply 
power series to order N; the total time is therefore O(\WNM(N) + N?) = O(N?). 
A still faster method can be based on the identity U(Vo(z) + 2”Vi(z)) = U(Vo(z)) + 
2™U' (Vo(z))Vi(z) +2?" U” (Vo(z)) Vi (z)?/2!4-- --, extending to about N/m terms, where 
we choose m ~ \/N/log N; the first term U(Vo(z)) is evaluated in O(mN (log N)?) 
operations using a method somewhat like that in exercise 4.6.4—43. Since we can go from 
U (Vo(z)) to Ut (Vo(z)) in O(N log N) operations by differentiating and dividing 
by Vg (z), the entire procedure takes O(mN (log N)?+(N/m) N log N) = O(N log N)3/? 
operations. [JACM 25 (1978), 581-595.] 
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When the polynomials have m-bit integer coefficients, this algorithm involves 
roughly N°/?+© multiplications of (N lgm)-bit numbers, so the total running time 
will be more than N*/?. An alternative approach with asymptotic running time 
O(N?**) has been developed by P. Ritzmann [Theoretical Comp. Sci. 44 (1986), 1-16]. 
Composition can be done much faster modulo a small prime p (see exercise 26). 


12. Polynomial division is trivial unless m > n > 1. Assuming the latter, the equation 
u(x) = q(x)v(x) + r(x) is equivalent to U(z) = Q(z) V (2) +2™—"T' R(z) where U(x) = 
rulc), V(x) = x" v(a), Q(x) = &™-"q(a7"), and R(x) = a”~'r(a+) are the 
“reverse” polynomials of u, v, q, and r. 

To find q(x) and r(x), compute the first m — n + 1 coefficients of the power series 
U(z)/V(z) = W(z) + O(z™—"*"); then compute the power series U(z) — V(z)W(2), 
which has the form z”~"t'T(z) where T(z) = To + Tiz+---. Note that Tj = 0 for all 
j > n; hence Q(z) = W(z) and R(z) = T(z) satisfy the requirements. 


13. Apply exercise 4.6.1-3 with u(z) = 2% and v(z) = Wo +--- + Ww-12%7?; the 
desired approximations are the values of v3(z)/v2(z) obtained during the course of 
the algorithm. Exercise 4.6.1—26 tells us that there are no further possibilities with 
relatively prime numerator and denominator. If each W; is an integer, an all-integer 
extension of Algorithm 4.6.1C will have the desired properties. 

Notes: See the book History of Continued Fractions and Padé Approximants by 
Claude Brezinski (Berlin: Springer, 1991) for further information. The case N = 2n+1 
and deg(w1) = deg(w2) = n is of particular interest, since it is equivalent to a so-called 
Toeplitz system; asymptotically fast methods for Toeplitz systems are surveyed in Bini 
and Pan, Polynomial and Matrix Computations 1 (Boston: Birkhauser, 1994), §2.5. 
The method of this exercise can be generalized to arbitrary rational interpolation of 
the form W(z) = p(z)/q(z) (modulo (z — 21)...(z — zw)), where the z;’s need not be 
distinct; thus, we can specify the value of W(z) and some of its derivatives at several 
points. See Richard P. Brent, Fred G. Gustavson, and David Y. Y. Yun, J. Algorithms 
1 (1980), 259-295. 


14. If U(z) = z+Up2z"4+--- and V(z) = 2° +Veqi2"t'4+---, we find that the difference 
V(U(z)) — U"(z)V(z) is > z?FTi—=1i(UkVk+j — Uk+; + (polynomial involving only 
Uk, --+; Uk+j-1, Viti, ---, Ve+j—1)); hence V(z) is unique if U (z) is given and U (z) 
is unique if V (z) and Ux are given. 

The solution depends on two auxiliary algorithms, the first of which solves the 
equation V(z+z*U(z)) = (1+2*-1W(z))V(z) +2" -18(z)+.O(2*- 1”) for V(z) = Vo+ 
Viz t+: +V,-12"71, given U(z), W(z), S(z), and n. If n = 1, let Vo = —S(0)/W(0); 
or let Vo be arbitrary when S(0) = W(0) = 0. To go from n to 2n, let 


V(z+2"U(z)) = (1+ 2" 'W(2))V (2) + 2” *S(2) — 2? Re) + O(2"*1""), 
1+2" W(z) = (2/(z + 2*U(z)))"(1 + 2* *W(2z)) + O(2* 4"), 
z) = (z/(z + z*U(z)))"R(z) + O(z"), 


and let V(z) = Va + Vagiz +::>+Von-12"7! satisfy 
V(z4 2*U(z)) = (14+ 2* *W(z))V(z) + 2" 1 9(z) + O(2*- 1), 
The second algorithm solves W(z)U(z) + 2U'(z) = V(z) + O(2”) for U(z) = 


) 
Uo+U12+: : -+Un-12"71, given V(z), W(z), and n. Ifn = 1, let Uo = V(0)/W(0), or let 
Uo be arbitrary in case V(0) = W(0) = 0. To go from n to 2n, let W(z)U(z)+2U'(z) = 
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V(z) — 2"R(z) + O(2?"), and let U(z) = Un +--+ + Uon_1z"! be a solution to the 
equation (n + W(z))U(z) + 2U"(z) = R(z) + O(z”). 

Resuming the notation of (27), the first algorithm can be used to solve V(U(z)) = 
U'(z)(z/U(z))*V(z) to any desired accuracy, and we set V(z) = z*V(z). To find 
P(z), suppose we have V(P(z)) = P’(z)V(z) + O(z?*"1*”), an equation that holds 
for n = 1 when P(z) = z+ az and a is arbitrary. We can go from n to 2n by 
letting V(P(z)) = P’(z)V(z) + 271+" R(z) + O(2?*-142") and replacing P(z) by 
P(z)+2**" P(z), where the second algorithm is used to find the polynomial P(z) such 
that (k +n — 2zV'(P(z))/V(z))P(2) + zP (2) = (2*/V(z)) R(z) + O(2"). 

15. The differential equation U’(z)/U(z)* = 1/2" implies that U(z)'~* = z1~* + c for 
some constant c. So we find U™ (z) = z/(1 + enzi“*)V/A-Y, 

A similar argument solves (27) for arbitrary V(z): If W'(z) = 1/V(z), we have 
W(U'(z)) = W (z) + nc for some c. 

16. We want to show that [¢"] t?*1((n+1)Ri41(t)/V(t)" —n Ri, (t)/V(t)"*1) = 0. This 
follows since (n + 1)Ri,41(t)/V(t)” — nRi()/V()"t* = SCR (t)/V(t)"**). Conse- 
quently we have n~t [t-t] Ri (t) t7/V(t)" = (n —1)7'[t"-2] R(t) t7 1/V (H271 = + = 
17E] Ra (6) t/V(t) = [t] Rn(t)/V = Wa. 

17. Equating coefficients of rty”, the convolution formula states that (Er) unam) = 
Ep (P) Vki V(n—-k)m, which is the same as [2”] V(z)'t™ = E (IV (eV 7] V(z)™), 
which is a special case of (2). 

Notes: The name “poweroid” was introduced by J. F. Steffensen, who was the 
first of many authors to study the striking properties of these polynomials in general 
[Acta Mathematica 73 (1941), 333-366]. For a review of the literature, and for further 
discussion of the topics in the next several exercises, see D. E. Knuth, The Mathematica 
Journal 2 (1992), 67-78. One of the results proved in that paper is the asymptotic 
formula V, (x) = eran em] Voy + O(y?) + O(a~")), if Vi = 1 and sV"(s) = y and 
y = n/x is bounded as x —> œ and n > ov. 

18. We have V,(x) = >, 2*n! [2”] V(z)*/k! = n! [z"] e?”. Consequently Va(ax)/x = 
(n — 1)! [2"""] V'(z) e"” when n > 0. We get the stated identity by equating the 
coefficients of z”~1 in V'(z) e@t9Y@ = V'(z) eV Mew ™, 


19. We have 


onli an [V V22, U3 3 | ne 
[z - z - zZ Here 
2! 3! 


n! V1 kı v2 k2 Un kn 
~ a Sec a) (3) Se. 


ky tket--+tkn=m 
ky +2ko+---+nkyn=n 
k1,k2,...,kn 20 


by the multinomial theorem 1.2.6—-(42). These coefficients, called partial Bell polyno- 
mials [see Annals of Math. (2) 35 (1934), 258-277], arise also in Arbogast’s formula, 
exercise 1.2.5-21, and we can associate the terms with set partitions as explained in 
the answer to that exercise. The recurrence 


n—-1 
Unk = > ( 1 era 
J 


J 


shows how to calculate column k from columns 1 and k—1; it is readily interpreted with 


respect to partitions of {1,...,n}, since there are CED ways to include the element n 


—1 
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in a subset of size j. The first few rows of the matrix are 


U1 
V2 v? 
3 
U3 3U1V2 vi 
2 2 4 
V4 4v103 + 3v3 Gui v2 vi 


us 5viva + l0v2v3 15v1v2 + 10v?v3 10v? v2 vi 


20. [2] W(2)* = EUe] V e); hence wn = (n/K!) Z ((iel/ 9) use) x 
((j!/n!)unj). [E. Jabotinsky, Comptes Rendus Acad. Sci. 224 (Paris, 1947), 323-324.] 
21. (a) If U (z) = aW (8z) we have unk = % [z"] (aW(8(z))* = a" 8” wng; in partic- 
ular, if U(z) = V! (2) = —W(—z) we have unk = (1) wnk. So X, UnkUkm and 
Di UnkUkm correspond to the identity function z, by exercise 20. 

(b) [Solution by Ira Gessel.] This identity is, in fact, equivalent to Lagrange’s 
inversion formula: We have wnk = (—1)"~"unk = (—1)"-*% [2”] vi-(z)*, and the 
coefficient of 2” in VI-1(z)* is n+ [tt] kt"+*-1/V(t)” by exercise 16. On the 
other hand we have defined v(_,)(-n) to be (~k) = [z""*] (V(z)/z))~”, which equals 
(—1)"-*(n — 1)... (k + 1)k [z771] ert k-Y/V(z)”. 

22. (a) If V(z) = Ut*(z) and W(z) = VI® (z), we have W(z) = V(zW(2)f) = 
U(z2W (2)? V(zW(z)*)*) = U(zW(z)°**). (Notice the contrast between this law and 
the similar formulas Ull(z) = U (z), Ul@l[6l(z) = Ul¢4l(z) that apply to iteration.) 

(b) BT? (z) is the generating function for binary trees, 2.3.4.4-(12), which is 
W(z)/z in the example z = t — t? following Algorithm L. Moreover, Bt'!(z) is the 
generating function for t-ary trees, exercise 2.3.4.4—-11. 

(c) The hint is equivalent to zU{*!(z)* = W'-1l(z), which is equivalent to the 
formula 2U‘*} (z)°/U(zU'*} (z)*)* = z. Now Lagrange’s inversion theorem (exercise 8) 
says that [z"] WI-(z)* = =[z~*] W(z)~” when is a positive integer. (Here W(z)~” 
is a Laurent series—a power series divided by a power of z; we can use the notation 
[2] V(z) for Laurent series as well as for power series.) Therefore [z"] Ut*}(z)” = 
[2”] (wit (z)/z)*/* = [gr te/o) wE 1 (z)7/a is equal to ats ao W (z) 7777/2 = 
Sere [2727/9] 2-"-*/2U(z)*+"* when z/a is a positive integer. We have verified the 
result for infinitely many a; that is sufficient, since the coefficients of U {o}(z)* are 
polynomials in a. 

We’ve seen special cases of this result in exercises 1.2.6-25 and 2.3.4.4-29. One 
memorable consequence of the hint is the case a = —1: 


W(z) = 2U(z) if and only if wz) = gf (2) : 


(d) If Uo = 1 and V,,() is the poweroid for V(z) = ln U (z), we’ve just proved that 
«V,(a + na)/(x + na) is the poweroid for In Ut% (z). So we can plug this poweroid 
into the former identities, changing y to y — an in the second formula. 

23. (a) We have U = I +T where T” is zero in rows < n. Hence lnU = T — ¿T? + 

iT? —--- will have the property that exp(aln U) = T4 (QT | ()i" H--- =U. Each 

entry of U“ is a polynomial in a, and the relations of exercise 19 hold whenever a is a 

positive integer; therefore U“ is a power matrix for all a, and its first column defines 

U! (z). (In particular, UT} is a power matrix; this is another way to revert U(z).) 
(b) Since US = I + eln U + O(c’), we have 


Ine = [elu = P parje (2 + eL) + OC) = T [2] ez" Ee). 
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(c) £ull(z) = [q] U+ (2), and we have 


Ult (a) = U (Ul a)) = U (z + eL(z) +0). 


Also Ule*4(z) = UI (U (2)) = U (2) + eL(U"(z)) + O(e?). 

(d) The identity follows from the fact that U commutes with lIn U. It determines 
In—1 when n > 4, because the coefficient of ln—ı on the left is nu2, while the coefficient 
on the right is Un(n—1) = e Jur. Similarly, if u2 = = uk—ı = 0 and uz Æ 0, we have 
lk = ux and the recurrence for n > 2k determines lat, lx42,...: The left side has the 
form ln + Phi i Plai kük +--+: and the right side has i form ln + CO cgay 4... 
In general, l2 = ua, l3 = u3 — Er l4 = ua — 5uz2u3 + Sus, l5 = us uzus 5u3 - 
185 243 — 20u24. 

(e) We have U = 9°, (InU)™/m!, and for fixed m the contribution to un = unt 
from the mth term is Y lnmnm-i -e lnonilnino SUMMed over n = Nm > ++: > n > 
no = 1. Now apply the result of part (b). [See Trans. Amer. Math. Soc. 108 (1963), 
457—477.] 
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24. (a) By (21) and exercise 20, we have U = VDV~ where V is the power matrix 
of the Schröder function and D is the diagonal matrix diag(u, u’, u°,...). So we may 
take In U = Vdiag(Inu,2Inu,3lnu,...)V~'. (b) The equation WVDV | = VDV “ŻW 
implies (V-'WV)D = D(V-!WV). The diagonal entries of D are distinct, so VWV 
must be a diagonal matrix D’. Thus W = VD’V~, and W has the same Schröder 
function as U. It follows that W; 40 and W = VD°V~!, where a = (ln W1)/(ln U1). 


25. We must have k = l because [z**’~1] U(V(z)) = Uk+ı-1 + Veti-1 + kURV;. To 
complete the proof it suffices to show that Up = Ve and U(V(z)) = V(U(z)) implies 
U(z) = V(z). Suppose l is minimal with U; 4 Vj, and let n = k +1l— 1. Then we 
have Unk — Unk = (7) (uz — V1); Unj = Vnj for all j > k; Unt = (7) urs and Unj = 0 for 
L< j <n. Now the sum $` ,Unjvj = Un + UnkUk +--+: + Uniti + Un must be equal to 
2 Unj Uj; so we find (7) (ur — vi) uk = (7) vr (u — vı). But we have ye = (= 
if and only if k = l. 

[From this exercise and the previous one, we might suspect that U(V(z)) = 
V(U(z)) only when one of U and V is an iterate of the other. But this is not necessarily 
true when U; and Vj are roots of w For example, if Vı = —1 and U (z) = V"I(z), 
V is not an iterate of U"/7), nor is U"/?! an iterate of Vi] 


6. Writing U(z) = Ujo (2°) + 2Up (2°), we have U(V(z)) = Ujo (Viz? + Vaz4* +--+) + 
ve) Uj (V12? +V2z4+--+) (modulo 2). The running time satisfies T(N) = 2T(N/2)+ 
C(N), where C(N) is essentially the time for polynomial multiplication mod 2N. We 
can make C(N) = O(N'**) by the method of, say, exercise 4.6.4-59; see also the answer 
to exercise 4.6-5. 

A similar method works mod p in time O(pN'**). [D. J. Bernstein, J. Symbolic 
Computation 26 (1998), 339-341.] 


27. From (W(qz) — W(z))V(z) = W(z)(V(q™z) — V(z)) we obtain the recurrence 
Wn = Th, VeWn—n(q*™” — 9” *)/(q" — 1). [J. Difference Eqs. and Applics. 1 (1995), 
57-60.] 


28. Note first that d(U(z)V(z)) = (6U(z))V(z) + U(z)(6V(z)), because t(mn) = 
t(m) + t(n). Therefore 5(V(z)") = nV(z)""'6V(z) for all n > 0, by induction 
on n; and this is the identity we need to show that de’ = nso 6(V(z)"/n!) = 
eY)§V(z). Replacing V(z) by InV(z) in this equation gives V(z) 6 In V(z) = 6V(z); 
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hence 6(V(z)*) = de®#®Y@ = e*™YO §(alnV(z)) = aV(z)*"} for all complex 
numbers a. 
It follows that the desired recurrences are 
(a) Wi = 1, Wa = Dan, agile + 1)E(d) /t(n) — 1) VaWnya; 
(b) Wi = 1, Wn = Vian, a>1 (t(d)/t(n))VaWn sai 
(c) Wi = 0, Wn = Vn + Xan, api (t(@)/t(n) — 1) VaWnya . 


[See H. W. Gould, AMM 81 (1974), 3-14. These formulas hold when t is any function 
such that t(m) + t(n) = t(mn) and t(n) = 0 if and only if n = 1, but the suggested t 
is simplest. The method discussed here works also for power series in arbitrarily many 
variables; then ¢ is the total degree of a term.] 


“It is certainly an idea you have there,” said Poirot, with some interest. 
“Yes, yes, | play the part of the computer. 
One feeds in the information — ” 


“And supposing you come up with all the wrong answers?” said Mrs. Oliver. 


“That would be impossible,” said Hercule Poirot. 
“Computers do not do that sort of a thing.” 


“They're not supposed to,” said Mrs. Oliver, 
“put you'd be surprised at the things that happen sometimes.” 


— AGATHA CHRISTIE, Hallowe’en Party (1969) 


APPENDIX A 


TABLES OF NUMERICAL QUANTITIES 


Table 1 


QUANTITIES THAT ARE FREQUENTLY USED IN STANDARD SUBROUTINES 
AND IN ANALYSIS OF COMPUTER PROGRAMS (40 DECIMAL PLACES) 


V2 = 1.41421 35623 73095 04880 16887 24209 69807 85697— 
v3 = 1.73205 08075 68877 29352 74463 41505 87236 69428+ 
V5 = 2.23606 79774 99789 69640 91736 68731 27623 54406+ 
v10 = 3.16227 76601 68379 33199 88935 44432 71853 37196— 
V2 = 1.25992 10498 94873 16476 72106 07278 22835 05703— 
V3 = 1.44224 95703 07408 38232 16383 10780 10958 83919— 
V2 = 1.18920 71150 02721 06671 74999 70560 47591 52930— 
ln 2 = 0.69314 71805 59945 30941 72321 21458 17656 80755+ 
In3 = 1.09861 22886 68109 69139 52452 36922 52570 46475— 

In 10 = 2.30258 50929 94045 68401 79914 54684 36420 76011+ 
1/In 2 = 1.44269 50408 88963 40735 99246 81001 89213 74266+ 
1/In 10 = 0.43429 44819 03251 82765 11289 18916 60508 22944— 
T = 3.14159 26535 89793 23846 26433 83279 50288 41972— 

1° = 7/180 = 0.01745 32925 19943 29576 92369 07684 88612 71344+ 
1/m = 0.31830 98861 83790 67153 77675 26745 02872 40689+ 

n? = 9.86960 44010 89358 61883 44909 99876 15113 53137— 

Ja = P(1/2) = 1.77245 38509 05516 02729 81674 83341 14518 27975+ 
T (1/3) = 2.67893 85347 07747 63365 56929 40974 67764 41287— 
T (2/3) = 1.35411 79394 26400 41694 52880 28154 51378 55193+ 
e = 2.71828 18284 59045 23536 02874 71352 66249 77572+ 

1/e = 0.36787 94411 71442 32159 55237 70161 46086 744584 

e? = 7.38905 60989 30650 22723 04274 60575 00781 318034 

y = 0.57721 56649 01532 86060 65120 90082 40243 10422— 

ln m = 1.14472 98858 49400 17414 34273 51353 05871 16473— 

ob = 1.61803 39887 49894 84820 45868 34365 63811 77203+ 

e” = 1.78107 24179 90197 98523 65041 03107 17954 91696+ 
e”/4 = 2.19328 00507 38015 45655 97696 59278 73822 34616- 
sin 1 = 0.84147 09848 07896 50665 25023 21630 29899 96226— 
cos 1 = 0.54030 23058 68139 71740 09366 07442 97660 37323+ 
—¢'(2) = 0.93754 82543 15843 75370 25740 94567 86497 78979— 
¢(3) = 1.20205 69031 59594 28539 97381 61511 44999 07650— 
In @ = 0.48121 18250 59603 44749 77589 13424 36842 31352— 
1/In ¢ = 2.07808 69212 35027 53760 13226 06117 79576 77422— 
—InIn2 = 0.36651 29205 81664 32701 24391 58232 66946 94543— 


726 


TABLES OF NUMERICAL QUANTITIES 727 


Table 2 


QUANTITIES THAT ARE FREQUENTLY USED IN STANDARD SUBROUTINES 
AND IN ANALYSIS OF COMPUTER PROGRAMS (45 OCTAL PLACES) 


The names at the left of the “=” signs are given in decimal notation. 


0.1 = 0.06314 63146 31463 14631 46314 63146 31468 14631 46315— 

0.01 = 0.00507 58412 17270 24365 60507 53412 17270 24365 60510— 

0.001 = 0.00040 61115 64570 65176 76355 44264 16254 02030 44672+ 
0.0001 = 0.00003 21556 13530 70414 54512 75170 33021 15002 35223— 
0.00001 = 0.00000 24761 32610 70664 36041 06077 17401 56063 384417— 
0.000001 = 0.00000 02061 57364 05586 66151 55323 07746 44470 2603834 
0.0000001 = 0.00000 00153 27745 15274 53644 12741 72312 20354 02151+ 
0.00000001 = 0.00000 00012 57143 56106 04303 473874 77341 01512 63327 
0.000000001 = 0.00000 00001 04560 27640 46655 12262 71426 40124 217424 
0.0000000001 = 0.00000 00000 06676 83766 35367 55653 37265 34642 01627— 
2= 1.32404 74681 77167 46220 42627 66115 46725 12575 17435+ 

V3 = 1.56663 65641 30231 25163 54453 50265 60361 34073 42223— 

V5 = 2.17067 36334 57722 47602 57471 63008 00563 55620 32021— 

V10 = 3.12305 40726 64555 22444 02242 57101 41466 33115 225324 

V2 = 1.20505 05746 15345 05342 10756 65334 25574 22415 03024+ 

V3 = 1.84233 50444 22175 13134 67363 16133 05334 31147 60121— 

V2 = 1.14067 74050 61556 12455 72152 64430 60271 02755 73136+ 

In2= 0.54271 02775 75071 73632 57117 07316 30007 71366 53640+ 

In3= 1.06237 24752 55006 05227 32440 63065 25012 35574 5533874 

ln10 = 2.23278 067385 52524 25405 56512 66542 56026 46050 50705+ 
1/m2= 1.84252 16624 58405 77027 35750 87766 40644 35175 04353+ 
1/ln10 = 0.33626 75425 11562 41614 52825 33525 27655 14756 06220— 

m= 3.11087 55242 10264 30215 14230 63050 56006 70163 211224 

1° = 7/180 = 0.01073 72152 11224 72344 25603 54276 63351 22056 115444 
1/m = 0.24276 30155 62344 20251 238760 47257 50765 15156 70067— 

n’ = 11.67517 14467 62135 713822 25561 15466 30021 40654 34103— 

Vm =T (1/2) = 1.61837 61106 64736 65247 47035 40510 15273 34470 17762— 
) = 2.58847 385234 51013 61316 73106 47644 54653 00106 66046— 

['(2/3) = 1.26523 57112 14154 74812 54572 87655 60126 23231 02452+ 
e= 2.55760 52130 50585 51246 52773 42542 00471 72363 61661+ 

1/e = 0.27426 53066 13167 46761 52726 75486 02440 52371 033554 
e = 7.30714 45615 23355 33460 63507 35040 32664 25356 502174 


Y= 0.44142 14770 67666 06172 23215 74376 01002 51313 25521 
lnm = 1.11206 40443 47503 86413 65374 52661 52410 87511 460574 
b= 1.47433 57156 27751 23701 27634 71401 40271 66710 15010+ 
e” = 1.61772 13452 61152 65761 22477 386553 59927 17554 21260+ 
T/4 2,14215 31512 16162 52370 35530 11342 53525 44307 02171— 
sinl = 0.65665 24436 04414 73402 038067 23644 11612 O7474 14505— 
cosl= 0.42450 50037 32406 42711 07022 14666 27820 70675 12321+ 
—C'(2) = 0.74001 45144 58253 42362 42107 23350 50074 46100 27706+ 
¢(3) = 1.147385 00023 60014 20470 15613 42561 31715 10177 06614+ 
In¢d= 0.36630 26256 61213 01145 13700 41004 52264 30700 40646+ 
1/lnġ = 2.04776 60111 17144 41512 11486 16575 00355 43630 40651+ 
—InIn2= 0.27351 71233 67265 63650 17401 56637 26334 31455 57005— 
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Several of the 40-digit values in Table 1 were computed on a desk calculator 
by John W. Wrench, Jr., for the first edition of this book. When computer 
software for such calculations became available during the 1970s, all of his 
contributions proved to be correct. The 40-digit values of other fundamental 
constants can be found in Eqs. 4.5.2-(60), 4.5.3-(26), 4.5.3-(41), 4.5.4-(g), and 
the answers to exercises 4.5.4—8, 4.5.4-25, 4.6.4-58. 


Table 3 


VALUES OF HARMONIC NUMBERS, BERNOULLI NUMBERS, 
AND FIBONACCI NUMBERS, FOR SMALL VALUES OF n 


NPP RPE RRP Re eRe 
SCO ANODTRWNHrFTOWOANAATAKRWNHHO 3 


wow hd Ww 
SC Oe 


NOnwnnwnvw 
OoN 


w 
Oo 


Ay, 


0 
1 
3/2 
11/6 
25/12 
137/60 
49/20 
363/140 
761/280 
7129/2520 
7381/2520 
83711/27720 
86021/27720 
1145993 /360360 
1171733 /360360 
1195757 /360360 
2436559 /720720 
42142223 / 12252240 
14274301 /4084080 
275295799 /77597520 
55835135/15519504 
18858053/5173168 
19093197/5173168 
444316699/118982864 
1347822955 /356948592 
34052522467 /8923714800 
34395742267 /8923714800 
312536252003/80313433200 
315404588903/80313433200 
9227046511387 /2329089562800 
9304682830147 /2329089562800 


Bn 
1 
-1/2 
1/6 
0 
—1/30 
0 
1/42 
0 
—1/30 
0 
5/66 
0 


—691/2730 


0 
7/6 
0 
—3617/510 
0 
43867 /798 
0 
~174611/330 
0 
854513/138 
0 


—236364091/2730 


0 
8553103/6 
0 
—23749461029/870 
0 


8615841276005 /14322 


89 

144 
233 
377 
610 
987 
1597 
2584 
4181 
6765 
10946 
17711 
28657 
46368 
75025 
121393 
196418 
317811 
514229 
832040 
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1 1 


n n+2 


For any zx, let H, = yl 


). Then 
n>1 

Hyg =2- 212, 

Hija =3 — $7/V3 — 3 1n3, 
Haj = 3 + 37/V3 — 3 n3, 
Hij =4—57-31n2, 

+ ir —3l2, 

srg°/?5-1/4 — žIn5-— 5V5Ing, 
smo 3/25-1/4 — 31nd + 4V/5Ing, 
smo 3/25-1/4 — 31nd + 3V5Ing, 
+ 5 g3/25-1/4 _— 51n5 — 4 5Ing, 
inV/3 21n2 3 In 3, 

inV/3 21n2 3 In 3, 


z 
ee 
ot 
II 
ao HD Aa woa wilor Ol we 
| 
T 


As /6 = 


and, in general, when 0 < p < q (see exercise 1.2.9-19), 
2 
Ayjg = g Z cot Er In2q +2 J cos “Pr r- lnsin Em 
p 2 q q 
l<n<q/2 
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INDEX TO NOTATIONS 


In the following formulas, letters that are not further qualified have the following 
significance: 


j,k integer-valued arithmetic expression 
m,n nonnegative integer-valued arithmetic expression 
x,y real-valued arithmetic expression 
z complex-valued arithmetic expression 
f real-valued or complex-valued function 
S, T set or multiset 
Where 
Formal symbolism Meaning defined 
Į | end of algorithm, program, or proof 1.1 
An or A[n] | the nth element of linear array A 1.1 
Amn or A[m,n] | the element in row m and column n of 
rectangular array A 1.1 
V « E | give variable V the value of expression E 1.1 


U & V | interchange the values of variables U and V | 1.1 


(R? a: b) | conditional expression: denotes 
a if relation R is true, b if R is false 


[R] | characteristic function of relation R: 


(R? 1: 0) 1.2.3 

ôkj | Kronecker delta: [j = k] 1.2.3 

[z”] g(z) | coefficient of z” in power series g(z) 1.2.9 
5 f(k) | sum ofall f(k) such that the variable k is an 

R(k) integer and relation R(k) is true 1.2.3 
Il f(k) | product of all f(k) such that the variable k 

R(k) is an integer and relation R(k) is true 1.2.3 


min f(k) | minimum value of all f(k) such that the var- 
iable k is an integer and relation R(k) is true | 1.2.3 


max f(k) | maximum value of all f(k) such that the var- 
) iable k is an integer and relation R(k) is true | 1.2.3 
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Where 
Formal symbolism Meaning defined 
Rz | real part of z 1.2.2 
Sz | imaginary part of z 1.2.2 
zZ | complex conjugate: Rz — i Sz 1.2.2 
AT | transpose of rectangular array A: 
AT [j,k] = Afk, j] 
x” | x to the y power (when z is positive) 1.2.2 
x! | x to the kth power: 
(: >0? [| a: 1/2 ) 1.2.2 
0<j<k 
x? | æ to the k rising: I'(a2+k)/I'(x) = 
(: >0? [[ @+s): 1/(@+ BT) 1.2.5 
0<j<k 
zË | x to the k falling: z!/(x — k)! = 
(: >0? J[ @- 7): IW@- a=) 1.2.5 
O<j<k 
n! | n factorial: T(n + 1) = n= 1.2.5 
f'(x) | derivative of f at x 1.2.9 
f” (x) | second derivative of f at x 1.2.10 
f'™ (x) | nth derivative: (n= 0? f(z): g'(x)), 
where g(a) = f("-Y(x) | 1.2.11.2 
f(x) | nth iterate: (n = 0? a: fg" (x))) 4.7 
f* (x) | nth induced function: 
{}(p) — {n} (pn 
fox) = f (ef ™(0)") ne 
H) | harmonic number of order z: 5 1/k” 1.2.7 
1<k<n 
HA, | harmonic number: HP 1.2.7 
F, | Fibonacci number: 
(n < 1? n: Fa-1 + Fr-2) 1.2.8 
B,, | Bernoulli number: n! [z”] z/(e* — 1) 1.2.11.2 
X-Y | dot product of vectors X = (#1,...,2n) 3.3.4 
and Y = (y1,.-., Yn): L1Y1 +` ean 
j\k | j divides k: k mod j = 0 and j > 0 1.2.4 
S\ T | set difference: {a | a in S and a not in T} 
© ®@ | rounded or special operations 4.2.1 
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Where 
Formal symbolism Meaning defined 
(...@1@9.a_1...)» | radix-b positional notation: 5°, a,b" 4l 
[[ £1, £2,...,En// | continued fraction: 
1/(ait+1/(@2+1/(---+1/(en)---))) | 453 
(;) binomial coefficient: (k < 0? 0: x£/k!) 1.2.6 
( n ) multinomial coefficient (defined only when 
Tiir Tiges Tir n = ni +n +: + Nm) 1.2.6 
[> Stirling number of the first kind: 
kıkə...kn-m 1.2.6 
0<kı<k2<-<kn-m<n 
>) Stirling number of the second kind: 
kika... kn-m 1.2.6 
1<ki <ko<<Skn—m<m 
{a | R(a)} | set of all a such that the relation R(a) is true 
{a1,..., an} | the set or multiset {a, |1<k <n} 
{x} | fractional part (used in contexts where a 
real value, not a set, is implied): x — |x| 1.2.11.2 
[a..b] | closed interval: {x | a < x < b} 1.2.2 
(a..b) | open interval: {x | a < x < b} 1.2.2 
[a..b) | half-open interval: {x |a < x < b} 1.2.2 
(a..b] | half-closed interval: {x | a < x < b} 1.2.2 
|S| | cardinality: the number of elements in set S$ 
|x| | absolute value of x: (x > 0? z: — x) 
|z| | absolute value of z: V2z 1.2.2 
|x| | floor of x, greatest integer function: maxg<sk| 1.2.4 
[x] | ceiling of x, least integer function: min,>,k | 1.2.4 
((a)) | sawtooth function 3.3.3 
(Xn) | the infinite sequence Xo, X1, Xo, ... 
(here the letter n is part of the symbolism) 1.2.9 
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Where 
Formal symbolism Meaning defined 
y | Euler’s constant: limy—+oo(Hn — Inn) 12.7 
q(x, y) | incomplete gamma function: Ie e~t dt 1.2.11.3 
I(x) | gamma function: (x — 1)! = y(x, 00) 1.2.5 
6(a) | characteristic function of the integers Tad 
e | base of natural logarithms: ->o 1/n! 1.2.2 
C(a) | zeta function: limpo H{”) (when x > 1) 1.2.7 
Kn(z1,...,£n) | continuant polynomial 4.5.3 
(u) | leading coefficient of polynomial u 4.6 
l(n) | length of shortest addition chain for n 4.6.3 
A(n) | von Mangoldt’s function 4.5.3 
u(n) | Möbius function 4.5.2 
v(n) | sideways sum 4.6.3 
O(f(n)) | big-oh of f(n), as the variable n + oo 1.2.11.1 
O(f(z)) | big-oh of f(z), as the variable z > 0 1.2.11.1 
Q(f(n)) | big-omega of f(n), as the variable n — co 1.2.11.1 
O(f(n)) | big-theta of f(n), as the variable n — oo 12.111 
m(x) | prime count: JX „<| is prime] 4.5.4 
T | circle ratio: 47,5, (-1)"/(2n + 1) 4.3.1 
@ | golden ratio: (1 + V5) 1.2.8 
Ø | empty set: {x | 0 = 1} 
y(n) | Euler’s totient function: \)o<,en[k Ln] 1.2.4 
co | infinity: larger than any number 4232 
det(A) | determinant of square matrix A 1.2.3 
sign(x) | sign of x: (x = 0? 0: x/|z]) 
deg(u) | degree of polynomial u 4.6 
cont(u) | content of polynomial u 4.6.1 
pp (u(x)) primitive part of polynomial u 4.6.1 
log, x | logarithm, base b, of x (when x > 0, 
b> 0, and b Æ 1): the y such that z = bY 1.2.2 
Ing | natural logarithm: log, x 1.2.2 
lgx | binary logarithm: logs x 1.2.2 
exp | exponential of x: e” 1.2.9 
jL k | j is relatively prime to k: gcd(j,k) = 1 1.2.4 
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Where 
Formal symbolism Meaning defined 
gcd(j,k) | greatest common divisor of j and k: 
( =k=0? 0: max a) 4.5.2 
d\j, d\k 
lcm(j, k) | least common multiple of j and k: 
(jx =0? 0: min a) 4.5.2 
d>0, j\d, k\d 
xmody | mod function: (y = 0? z: x — y|z/y]) 1.2.4 
u(x) mod v(x) | remainder of polynomial u after division by 
polynomial v 4.6.1 
x = x' (modulo y) | relation of congruence: x mod y = a’ mod y 1.2.4 
xz xy | x is approximately equal to y 3.5, 4.2.2 
Pr(S(n)) | probability that statement S(n) is true, for 
random positive integers n 3.5 
Pr(S(X)) | probability that statement S(X) is true, for 
random values of X 1.2.10 
EX | expected value of X: $, xPr(X = z) 1.2.10 
mean(g) | mean value of the probability distribution 
represented by generating function g: g’(1) 1.2.10 
var(g) | variance of the probability distribution 
represented by generating function g: 
PO +9'(1) — 9'(1)? 1.2.10 
(min x1, ave a, a random variable having minimum 
max 23, dev x4) | value 21, average (expected) value x2, 
maximum value x3, standard deviation x4 1.2.10 
u | one blank space 1.3.1 
rA | register A (accumulator) of MIX 1.3.1 
rX | register X (extension) of MIX 1.3.1 
rll,...,rI6 | (index) registers I1, ..., I6 of MIX 1.3.1 
rJ | (jump) register J of MIX 1.3.1 
(L:R) | partial field of MIX word,0<L<R<5 1.3.1 
OP ADDRESS, I(F) | notation for MIX instruction 1.3.1, 1.3.2 
u | unit of time in MIX 1.3.1 
* | “self” in MIXAL 1.3.2 
OF, 1F, 2F,..., 9F | “forward” local symbol in MIXAL 1.3.2 
OB, 1B, 2B, ..., 9B | “backward” local symbol in MIXAL 1.3.2 
OH, 1H, 2H, ..., 9H | “here” local symbol in MIXAL 1.3.2 
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INDEX TO ALGORITHMS AND THEOREMS 


Algorithm 3.1K, 5. 
Theorem 3.2.1.2A, 17-19. 
Theorem 3.2.1.2B, 20. 
Theorem 3.2.1.2C, 20-21. 
Theorem 3.2.1.2D, 21. 
Lemma 3.2.1.2P, 17-18. 
Lemma 3.2.1.2Q, 18. 
Lemma 3.2.1.2R, 19. 
Algorithm 3.2.2A, 28. 
Program 3.2.2A, 28. 
Algorithm 3.2.2B, 34. 
Program 3.2.2B, 34. 
Algorithm 3.2.2M, 33. 
Algorithm 3.2.2X, 557. 
Algorithm 3.2.2Y, 557. 
Algorithm 3.3.2C, 64-65. 
Algorithm 3.3.2G, 62. 
Algorithm 3.3.2P, 65-66. 
Algorithm 3.3.2R, 563. 
Algorithm 3.3.25, 71. 
Lemma 3.3.3B, 84-85. 
Algorithm 3.3.3D, 573. 
Theorem 3.3.3D, 87. 
Theorem 3.3.3K, 89. 
Theorem 3.3.3P, 80-81. 
Lemma 3.3.4A, 99. 
Theorem 3.3.4N, 113. 
Algorithm 3.3.45, 101-1083. 
Algorithm 3.3.48’, 582. 
Algorithm 3.4.1A, 134. 
Algorithm 3.4.1B, 588-589. 
Algorithm 3.4.1F, 129. 
Algorithm 3.4.1G, 587. 
Algorithm 3.4.1L, 126. 


Algorithm 3.4.1M, 127-128. 


Algorithm 3.4.1N, 587. 
Algorithm 3.4.1P, 122. 
Algorithm 3.4.1R, 130-131. 


Algorithm 3.4.15, 133. 
Algorithm 3.4.2P, 145. 
Algorithm 3.4.2R, 144. 
Algorithm 3.4.25, 142. 
Definition 3.5A, 150. 
Theorem 3.5A, 152-153. 
Definition 3.5B, 151. 
Theorem 3.5B, 153-154. 
Definition 3.5C, 151. 
Theorem 3.5C, 155-158. 
Definition 3.5D, 151. 
Definition 3.5E, 155. 
Lemma 3.5E, 156. 
Theorem 3.5F, 158. 
Theorem 3.5G, 174. 
Algorithm 3.5L, 173. 


Theorem 3.5M, 166-167. 


Corollary 3.5P, 154. 
Definition 3.5P, 171. 
Theorem 3.5P, 175. 
Lemma 3.5P1, 171-172. 
Lemma 3.5P2, 172. 
Lemma 3.5P3, 172. 
Lemma 3.5P4, 172. 
Definition 3.5Q1, 168. 
Definition 3.5Q2, 168. 
Definition 3.5R1, 159. 
Definition 3.5R2, 159. 
Definition 3.5R3, 161. 
Definition 3.5R4, 161. 
Definition 3.5R5, 162. 
Definition 3.5R6, 163. 
Corollary 3.55, 154. 
Lemma 3.5T, 163. 
Algorithm 3.5W, 164. 


Theorem 3.5W, 164-165. 


Algorithm 4.1H, 610. 
Algorithm 4.15, 609. 
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Algorithm 4.2.1A, 216-217. 
Program 4.2.1A, 218-219. 
Algorithm 4.2.1M, 220. 
Program 4.2.1M, 220-221. 
Algorithm 4.2.1N, 217. 
Theorem 4.2.2A, 235. 
Theorem 4.2.2B, 236. 
Theorem 4.2.2C, 236. 
Theorem 4.2.2D, 237. 
Lemma 4.2.2T, 235. 
Program 4.2.3A, 247-249. 
Program 4.2.3D, 251-252. 
Program 4.2.3M, 249-250. 
Theorem 4.2.4F, 260-262. 
Lemma 4.2.4Q, 258-259. 
Algorithm 4.3.1A, 266. 
Program 4.3.1A, 266-267. 
Theorem 4.3.1A, 271. 
Algorithm 4.3.1A’, 623. 
Program 4.3.1A’, 623. 
Algorithm 4.3.1B, 623. 
Program 4.3.1B, 624. 
Theorem 4.3.1B, 272. 
Algorithm 4.3.1C, 623-624. 
Algorithm 4.3.1D, 272-273. 


Program 4.3.1D, 273-275, 626. 


Algorithm 4.3.1M, 268. 
Program 4.3.1M, 269-270. 
Algorithm 4.3.1N, 282. 
Algorithm 4.3.1Q, 625. 
Algorithm 4.3.15, 267. 
Program 4.3.15, 267—268. 
Theorem 4.3.2C, 286. 
Theorem 4.3.25, 291. 
Theorem 4.3.3A, 296. 
Theorem 4.3.3B, 302. 
Algorithm 4.3.3R, 312. 
Algorithm 4.3.3T, 299-301. 
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Algorithm 4.4A, 636. 
Algorithm 4.5.2A, 337. 
Program 4.5.2A, 337. 
Program 4.5.2A’, 373. 
Algorithm 4.5.2B, 338. 
Program 4.5.2B, 339-340. 
Algorithm 4.5.2C, 341. 
Theorem 4.5.2D, 342. 
Algorithm 4.5.2E, 336. 
Algorithm 4.5.2K, 356. 
Algorithm 4.5.2L, 347. 
Algorithm 4.5.2X, 342. 
Algorithm 4.5.2Y, 646. 
Theorem 4.5.3E, 368. 
Theorem 4.5.3F, 360. 
Algorithm 4.5.3L, 375. 
Corollary 4.5.3L, 360. 
Lemma 4.5.3M, 367. 
Theorem 4.5.3W, 366. 
Algorithm 4.5.4A, 380. 
Theorem 4.5.4A, 396. 


Algorithm 4.5.4B, 385-386. 


Algorithm 4.5.4C, 387. 
Algorithm 4.5.4D, 389. 
Program 4.5.4D, 390. 
Theorem 4.5.4D, 402. 


Algorithm 4.5.4E, 397-398. 
Algorithm 4.5.4F, 659-660. 
Algorithm 4.5.4L, 667—668. 


Theorem 4.5.4L, 409-411. 
Algorithm 4.5.4P, 395. 
Algorithm 4.5.45, 658. 


Algorithm 4.6.1C, 428-429. 


Algorithm 4.6.1D, 421. 


Algorithm 4.6.1E, 426—427. 


Lemma 4.6.1G, 422-423. 
Lemma 4.6.1H, 423. 


Algorithm 4.6.1R, 425—426. 


Algorithm 4.6.15, 676. 
Algorithm 4.6.1T, 677. 


Algorithm 4.6.2B, 441-442. 
Algorithm 4.6.2D, 447-448. 


Algorithm 4.6.2F, 452. 
Algorithm 4.6.2N, 444. 
Algorithm 4.6.25, 681-682. 
Algorithm 4.6.3A, 462. 
Corollary 4.6.3A, 468. 
Program 4.6.3A, 691. 
Theorem 4.6.3A, 467—468. 
Theorem 4.6.3B, 468-469. 
Theorem 4.6.3C, 469-470. 
Theorem 4.6.3D, 470-471. 


Corollary 4.6.3E, 472-473. 
Theorem 4.6.3E, 471—472. 
Theorem 4.6.3F, 476. 
Theorem 4.6.3G, 479. 
Theorem 4.6.3H, 475-476. 
Lemma 4.6.3K, 474. 
Lemma 4.6.3P, 471. 
Algorithm 4.6.3T, 692. 
Theorem 4.6.4A, 496. 
Algorithm 4.6.4C, 713. 
Theorem 4.6.4C, 496. 
Algorithm 4.6.4D, 489. 
Theorem 4.6.4E, 493-494. 
Algorithm 4.6.4G, 698. 
Algorithm 4.6.4H, 489. 
Algorithm 4.6.4J, 698. 
Theorem 4.6.4M, 495. 
Algorithm 4.6.4N, 713-714. 
Algorithm 4.6.48, 489. 
Lemma 4.6.4T, 508. 
Theorem 4.6.4W, 513-514. 
Algorithm 4.7G, 720. 
Algorithm 4.7L, 527-528. 
Algorithm 4.7N, 529. 
Algorithm 4.7T, 528. 


At any step, arbitrary combinations of algorithms and theorems 
can be applied to solve a given problem. 


— KARSTEN HOMANN and JACQUES CALMET (1995) 


INDEX AND GLOSSARY 


Seek and ye shall find. 
— Matthew 7:7 


When an index entry refers to a page containing a relevant exercise, see also the answer to 
that exercise for further information. An answer page is not indexed here unless it refers to a 
topic not included in the statement of the exercise. 


O-origin indexing, 444, 512. 

0-1 matrices, 499. 

0-1 polynomials, 497, 519, 707. 

[0..1) sequence, 151. 

2-adic numbers, 213, 629. 

10-adic numbers, 632. 

1009, vi, 188, 413, 661. 

69069, 75, 106, 108. 

oo, representation of, 225, 244-245, 332. 

co-distributed sequence, 151-161, 
177, 180-182. 


y (Euler’s constant), 359, 379, 726-727, 733. 


m (circle ratio), 41, 151, 158, 161, 198, 200, 
209, 279-280, 284, 358, 726-727, 733. 
as “random” example, 21, 25, 33, 47, 52, 
89, 103, 106, 108, 184, 238, 243, 252, 
324-325, 555, 593, 599, 665. 
a(x) (prime count), 381-382, 416. 
p(n) (ruler function), 540. 
¢ (golden ratio), 164, 283, 359, 360, 514, 
652, 726-727, 733. 
logarithm of, 283. 
number system, 209. 
y(n) (totient function), 19-20, 289, 
369, 376, 583, 646. 
x2, 42, 56, see Chi-square. 


A priori tests, 80. 
Abacus, 196. 
binary, 200. 

Abel, Niels Henrik, binomial theorem, 
58, 535. 

Abramowitz, Milton, 44. 

Absolute error, 240, 309, 312-313. 

Absorption laws, 694. 

Abuse of probability, 433. 

Abuse of theory, 88. 

ACC: Floating point accumulator, 
218-219, 248-249. 

Acceptance-rejection method, 125-126, 
128-129, 134, 138, 139, 591. 

Accuracy of floating point arithmetic, 222, 
229-245, 253, 329, 438, 485. 

Accuracy of random number generation, 
27, 95, 105, 185. 


Adaptation of coefficients, 490-494, 516-517. 


Add-with-carry sequence, 23, 35, 72, 
108, 547. 


Addition, 194, 207, 210, 213, 265-267. 
complex, 487. 
continued fractions, 649. 
double-precision, 247—249, 251. 
floating point, 215-220, 227—228, 230-231, 
235-238, 253-254, 602. 
fractions, 330-331. 
left to right, 281. 
mixed-radix, 281. 
mod m, 12, 15, 203, 287-288. 
modular, 285-286, 293. 
multiprecision, 266-267, 276-278, 
281, 283. 
polynomial, 418—420. 
power series, 525. 
sideways, 463, 466. 
Addition chains, 465—485, 494, 519. 
ascending, 467. 
dual, 481, 485. 
10-, 479, 483, 485. 
star, 467, 473-477, 480, 482. 
Addition-subtraction chains, 484. 
Additive random number generation, 27-29, 
39-40, 186-188, 193. 
Adleman, Leonard Max, 403, 405, 
414, 417, 671. 
Admissible numbers, 177. 
Agrawal, Manindra (mẹ amata), 396. 
Ahrens, Joachim Heinrich Lüdecke, 
119, 129-130, 132-134, 136, 137, 
140, 141, 588. 
Ahrens, Wilhelm Ernst Martin Georg, 208. 
Akushsky, Izrail Yakovlevich (Axymckuit, 
VUspaunp Axossesu4), 292. 
al-Biruni, Abu al-Rayhan Muhammad 
ibn Ahmad (gla! gi 
al-Kashi, Jamshid ibn Mas‘td 
(AKI sgaeue Ge tutes), 198, 326, 462. 
al-Khwarizmi, Abu ‘Abd Allah 
Muhammad ibn Musa 
(Al ase Ge dame alll sie sil), 
197, 280. 
al-Samaw’al (= as-Samaw’al), 
ibn Yahya ibn Yahūda al-Maghribi 
(Jeseull aall lase Ge aes čel), 198. 
al-Uqlidisi, Abu al-Hasan 
Ahmad ibn Ibrahim 
(eval pale! Ge saat ouall ssi), 
198, 280-281, 461. 
Ala-Nissila, Tapio, 75, 570. 
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Alanen, Jack David, 30. 

Aldous, David John, 145. 

Alekseyev, Valery Borisovich (Asexcees, 
Banepuit Bopucosys), 699. 

Alexeev, Boris Vasilievich (Anexcees, 
Bopuc Bacunpesus), 117. 

Alexi, Werner, 669. 

Alford, William Robert, 659. 

Algebra, free associative, 437. 

Algebraic dependence, 496, 518. 

Algebraic functions, 533. 

Algebraic integers, 396. 

Algebraic number fields, 331, 333, 

345, 403, 674. 

Algebraic system: A set of elements 
together with operations defined 
on them, see Field, Ring, Unique 
factorization domain. 

ALGOL language, 279. 

Algorithms: Precise rules for transforming 
specified inputs into specified outputs 
in a finite number of steps. 

analysis of, 7—9, 76, 140, 147, 276-278, 
281, 301-302, 348-356, 360-373, 
377-378, 382-384, 399—400, 435, 
445-447, 455-456, 530-532, 658, 714. 

complexity of, 138, 178-179, 280, 294-318, 
396, 401-402, 416, 453, 465-485, 
494—498, 516-524, 720. 

discovery of, 99. 

historical development of, 335, 461-462. 

proof of, 281-282, 336-337, 592. 

Alias method, 120, 127, 139. 

Allouche, Jean-Paul, 656. 

ALPAK system, 419. 

Alt, Helmut, 706. 

American National Standards Institute, 
226, 246, 600, 602. 

AMM: American Mathematical Monthly, 
published by the Mathematical 
Association of America since 1894. 

Amplification of guesses, 172—174, 416—417. 

Analysis of algorithms, 7—9, 76, 140, 147, 
276-278, 281, 301-302, 348-356, 
360-373, 377-378, 382-384, 399-400, 
435, 445-447, 455-456, 530-532, 

658, 714. 

history, 360. 

Analytical Engine, 189, 201. 

Ananthanarayanan, Kasi (#78) 
AM HTT TuUosresr), 128. 

AND (bitwise and), 140, 188, 322, 328-329, 
389-390, 453, 671. 

Anderson, Stanley Frederick, 312. 

ANSI: The American National Standards 
Institute, 226, 246, 600, 602. 

Antanairesis, 335-336, 378. 

Apollonius of Perga (‘A moAXadvi0¢ 
ô Tlepyaiog), 225. 

Apparently random numbers, 3—4, 170-171. 

Apparition, rank of, 410-411. 


Approximate associative law, 232-233, 
239-240, 244. 

Approximate equality, 224, 233-235, 

239, 242-243, 245. 

Approximately linear density, 126. 

Approximation, by rational functions, 
438-439, 534. 

by rational numbers, 331-332, 
378-379, 617. 

Arabic mathematics, 197, 280-281, 
326, 461—462. 

Arazi, Benjamin (33N p33), 396. 

Arbitrary precision, 279, 283, 331, 416, 
see also Multiple-precision. 

Arbogast, Louis Frangois Antoine, 722. 

Archibald, Raymond Clare, 201. 

Arctangent, 313, 628. 

Aristotle of Stagira, son of Nicomachus 
(AptototéAne Nixoucxov 6 Urayrotenc), 
335. 

Arithmetic, 194-537, see Addition, 
Comparison, Division, Doubling, 
Exponentiation, Greatest common 
divisor, Halving, Multiplication, 
Reciprocals, Square root, Subtraction. 

complex, 205, 228, 283, 292, 307-310, 
487, 501, 506, 519, 700, 706. 

floating point, 214-264. 

fractions, 330-333, 420, 526. 

fundamental theorem of, 334, 422, 483. 

mod m, 12-16, 185-186, 203, 284, 
287-288. 

modular, 284—294, 302-305, 450, 454, 499. 

multiprecision, 265-318. 

polynomial, 418-524. 

power series, 525-537. 

rational, 330-333, 420, 526. 

Arithmetic chains, see Quolynomial chains. 

Armengaud, Joél, 409. 

Arney, James W., 385. 

Arrival time, 132. 

Arwin, Axel, 687. 

Aryabhata I (Tae), 343. 

ASCII: The American Standard Code for 
Information Interchange, 417. 

Ashenhurst, Robert Lovett, 240, 242, 327. 

Associative law, 229-233, 242, 341, 418, 694. 

approximate, 232-233, 239-240, 244. 

Asymptotic values: Functions that express 
the limiting behavior approached 
by numerical quantities, 59-60, 79, 
263-264, 355, 372-373, 377-378, 415, 
472, 525, 541-542, 659, 686, 722. 

Atanasoff, John Vincent, 202. 

Atkin, Arthur Oliver Lonsdale, 681. 

Atrubin, Allan Joseph, 315. 

Automata (plural of Automaton), 
313-317, 329, 416. 

Automorphic numbers, 293-294. 
Avogadro di Quaregna e Cerreto, Lorenzo 
Romano Amedeo Carlo, number, 

214, 227, 238, 240. 

Axioms for floating point arithmetic, 
230-231, 242-245. 

Avanzi, Roberto Maria, 396. 


b-ary number, 151. 
b-ary sequence, 151-153, 177. 
Babbage, Charles, 201. 
Babenko, Konstantin Ivanovich (Ba6enxo, 
Koncrantun Usanosus), 366, 376. 
Babington-Smith, Bernard, 3, 74, 76. 
Babylonian mathematics, 196, 225, 335. 
Bach, Carl Eric, 395, 661, 663, 689. 
Bachet, Claude Gaspard, sieur de 
Méziriac, 208. 
Bag, 694. 
Bailey, David Harold, 284, 634. 
Baker, Kirby Alan, 316. 
Balanced binary number system, 213. 
Balanced decimal number system, 211. 
Balanced mixed-radix number system, 
103, 293, 631. 
Balanced ternary number system, 207—208, 
209, 227, 283, 353. 
Ballantine, John Perry, 278. 
Bareiss, Erwin Hans, 262, 292, 434. 
Barlow, Jesse Louis, 262. 
Barnard, Robert, 292. 
Barnsley, Michael Fielding, 206. 
Barton, David Elliott, 74, 566. 
Barycentric coordinates, 567. 
Base of representation, 195. 
floating point, 214-215, 254, 263. 
Baseball, 378. 
Bauer, Friedrich Ludwig, 241-242, 327. 
Baum, Ulrich, 701. 
Baur, Walter, 718. 
Bays, John Carter, 34. 
Beauzamy, Bernard, 452, 461, 683, 684. 
Beckenbach, Edwin Ford, 135. 
Becker, Oskar Joachim, 359. 
Béjian, Robert, 164. 
Belaga, Edward Grigorievich (Benara, 
Oayapa [puropresns), 496. 
Bell, Eric Temple, polynomials, 722. 
Bell Telephone Laboratories Model V, 225. 
Bellman, Richard Ernest, ix. 
Ben-Or, Michael (NN -12 29ND), 669. 
Bender, Edward Anton, 385. 
Benford, Frank, 255. 
Bentley, Jon Louis, 141. 
Berger, Arno, 262. 
Bergman, George Mark, 676. 
Berkowitz, Stuart J., 718. 
Berlekamp, Elwyn Ralph, 439, 449, 456, 681. 
algorithm, 439—447. 
Bernoulli, Jacques (= Jakob = James), 200. 
numbers Bn, 355, 569. 
numbers, table, 728. 
sequences, 177. 
Bernoulli, Nicolas (= Nikolaus), 449. 
Bernstein, Daniel Julius, 396, 697, 724. 
Berrizbeitia Aristeguieta, Pedro José de 
la Santisima Trinidad, 396. 
Besicovitch, Abram Samoilovitch 
(Besuxosuu, A6pam Camoiinosus), 178. 
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Beta distribution, 134-135. 

Beyer, William Aaron, 115. 

Bharati Krishna Tirthaji Maharaja, 
Jagadguru Swami Sri (SUTRe Saray tr 
aret wer Pest AEs), 208. 

Bhaskara I, Acarya (HTeRTATS), 343. 

Bienaymé, Irénée Jules, 74. 

Bilinear forms, 506-514, 520-524. 

Billingsley, Patrick Paul, 384, 661. 

Bin-packing problem, 585. 

Binary abacus, 200. 

Binary basis, 212. 

Binary-coded decimal, 202, 322, 328-329. 

Binary computer: A computer that 
manipulates numbers primarily in the 
binary (radix 2) number system, 30-32, 
201-202, 276, 328, 339, 389-390. 

Binary-decimal conversion, 319-329. 

Binary digit, 195, 200. 

Binary gcd algorithms, 338-341, 

348-356, 435. 
compared to Euclid’s, 341. 
extended, 356. 

Binary method for exponentiation, 
461-463, 466, 482, 696. 

Binary number systems, 195, 198-206, 
209-213, 419, 461, 483. 

Binary point, 195. 

Binary recurrences, 318, 466, 634, 692, 714. 

Binary search, 324. 

Binary search trees, 593. 

Binary shift, 322, 339, 481, 637, 686. 

Binary trees, 378, 527, 696, 723. 

BINEG computer, 205. 

Binet, Jacques Philippe Marie, 653. 

identity: S771 ajj J k=1bkYk = 
X51 4jYj Da bkk + 
X 1<j<k<n (ajbe — agbj)(£jYk — kyj), 
564. 
Bini, Dario Andrea, 500, 505, 515, 
714, 715, 721. 
Binomial coefficients, 416, 516, 622. 
Binomial distribution, 136-138, 141, 
401, 559. 
tail of, 167. 

Binomial number system, see Combinatorial 
number system. 

Binomial theorem, 526, 534. 

Birnbaum, Zygmunt Wilhelm, 57. 

Birthday spacings, 34, 71-72, 78-79, 188. 

BIT: Nordisk Tidskrift for Informations- 
Behandling, an international journal 
published in Scandinavia since 1961. 

Bit: “Binary digit? either zero or 
unity, 195, 200. 

random, 12, 30-32, 35-36, 38, 48, 
119-120, 170-176. 
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Bitwise operations, 30-31, 140, 202, 
328-329, 389-390, 459, 605. 
and, 140, 188, 322, 328-329, 389-390, 
453, 671. 
exclusive or, 31, 32, 193, 419. 
or, 140, 686, 695. 
shifts, 322, 339, 481, 637, 686. 
Bjork, Johan Harry, 244. 
achman, Nelson Merle, 205. 
ack box, 455. 
aser, Markus, 700. 
eichenbacher, Daniel, 478. 
inn, James Frederick, 630. 
öte, Hendrik Willem Jan, 29. 
ouin, François Joseph Raymond 
Marcel, 582. 
uestein, Leo Isaac, 634. 
um, Bruce Ivan, 279. 
um, Fred, 433, 518. 
um, Lenore Carol, 36. 
um, Manuel, 36, 174, 179. 
integer, 174-176, 183, 416. 
Bofinger, Eve, 563. 
Bofinger, Victor John, 563. 
Bohlender, Gerd, 242, 616. 
Bojańczyk, Adam Wojciech, 646. 
Bolker, Ethan David, 593. 
Bombieri, Enrico, 683. 
norm, 458, 684. 
Boolean functions, 173-174. 
Boolean operations, see Bitwise operations. 
Boone, Steven Richard, 409. 
Booth, Andrew Donald, 608. 
Border rank, 522-523. 
Borel, Emile Félix Edouard Justin, 177. 
Borodin, Allan Bertram, 498, 505, 515, 707. 
Borosh, Itzhak, 106-107, 117, 291, 584. 
Borrow: A negative carry, 267, 273, 
281, 545. 
Borwein, Peter Benjamin, 284. 
Bosma, Wiebren, 665. 
Bouyer, Martine, 280. 
Bowden, Joseph, 201. 
Box, George Edward Pelham, 122. 
Boyar, Joan, 599. 
Boyd, David William, 691. 
Bradley, Gordon Hoover, 343, 378. 
Brakke, Kenneth Allen, 608. 
Bramhall, Janet Natalie, 530. 
Brauer, Alfred Theodor, 470, 478, 483, 690. 
Bray, Thomas Arthur, 33, 128, 544. 
Brent, Richard Peirce, 8, 28, 40, 130, 136, 
139, 141, 187, 241, 279, 280, 313, 348, 
352-353, 355, 356, 382, 386, 403, 501, 
529-534, 539-540, 556, 590, 600, 643, 
644, 646, 657, 658, 695, 719-721. 
Brezinski, Claude, 357, 721. 
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Brillhart, John David, 29, 394, 396, 400, 660. 


Brockett, Roger Ware, 712. 
Brocot, Achille, 655. 
Brontë, Emily Jane, 292. 


Brooks, Frederick Phillips, Jr., 226. 
Brouwer, Luitzen Egbertus Jan, 179. 
Brown, David, see Spencer Brown. 
Brown, George William, 135. 
Brown, Mark Robbin, 712. 
Brown, Robert, see Brownian motion. 
Brown, William Stanley, 419, 428, 
438, 454, 686. 
Brownian motion, 559. 
Bruijn, Nicolaas Govert de, 181, 212, 
568, 653, 664, 686, 694. 
cycle, 38—40. 
Brute force, 642. 
Bshouty, Nader Hanna (54 Ga sb), 700. 


Buchholz, Werner, 202, 226. 

Bunch, James Raymond, 500. 

Buneman, Oscar, 706. 

Bunimovich, Leonid Abramovich 
(Bynumosu4, Jleonng A6pamosu4), 262. 

Burgisser, Peter, 515. 

Burks, Arthur Walter, 202. 

Burrus, Charles Sidney, 701. 

Butler, James Preston, 77. 

Butler, Michael Charles Richard, 442. 


C language, 185-188, 193, 327, 556. 

CACM: Communications of the ACM, 
a publication of the Association for 
Computing Machinery since 1958. 

Cahen, Eugène, 676. 

Calculating prodigies, 279, 295. 

Calmet, Jacques Francis, 736. 

Cameron, Michael James, 409. 

Camion, Paul Frédéric Roger, 449. 

Campbell, Edward Fay, Jr., vii. 

Campbell, Sullivan Graham, 226. 

Cancellation error, 58, 245. 

avoiding, 617. 

Canonical signed bit representation, 611. 

Cantor, David Geoffrey, 446, 448, 449, 
455, 460, 672, 681. 

Cantor, Georg Ferdinand Ludwig 
Philipp, 209. 

Cantor, Moritz Benedikt, 655. 

Capovani, Milvio, 500, 715. 

Caramuel de Lobkowitz, Juan, 199-200. 

Cards, playing, 2, 145, 147, 190. 

Carissan, Eugène Olivier, 390. 

Carling, Robert Laurence, 104. 

Carlitz, Leonard, 84, 90. 

Carmichael, Robert Daniel, numbers, 
659, 662. 

Carr, John Weber, III, 226, 241, 242. 

Carroll, Lewis (= Dodgson, Charles 
Lutwidge), 435. 

Carry: An amount propagated to the 
current digit position from the digits in 
less significant positions, 205, 247, 266, 
268, 273, 276-278, 281, 419, 470, 547. 

Cassels, John William Scott, 109, 158. 

Casting out nines, 289, 303, 324. 


Castle, Clive Michael Anthony, 653. 
Catalan, Eugéne Charles, numbers, 723. 
Cauchy, Augustin Louis, 208. 
inequality, 97, 231. 
matrices, 331. 

CCITT: The International Telegraph 
and Telephone Consultative 
Committee of the ITU (International 
Telecommunication Union), 405. 

CDC 1604 computer, 291. 

CDC 7600 computer, 280. 

Ceiling function [x], 81, 732. 

Cellular automaton, see Linear iterative 
array. 

Cerlienco, Luigi, 683. 

Certificate of irreducibility, 460. 

Certificate of primality, 413. 

Cervantes Saavedra, Miguel de, 148. 

Cesaro, Ernesto, 354, 622, 640. 

Ceulen, Ludolph van, 198. 

Chace, Arnold Buffum, 462. 

Chain multiplications, 518, 519, 524. 

Chain steps, 494. 

Chains of primes, 415, 666. 

Chaitin, Gregory John, 170, 178. 

Chan, Tony Fan-Cheong (#2), 615. 

Chapple, Milton Arthur, 530. 

CHAR (convert to characters), 328. 

Characteristic, 214, see Exponent part. 

Characteristic polynomial, 499, 524. 

Charles XII of Sweden, 200. 

Chartres, Bruce Aylwin, 242. 

Chebotarev, Nikolai Grigorievich 
(Ue6orapés, Huxonait Mpuroppesuy), 
690. 

Chebyshev (= Tschebyscheff), Pafnutii 
Lvovich (He6sımes», Tacbrytiit 
JIpposuap = He6sımes, Nadcbuyrui 
JIpposu4), inequality, 183, 669. 

Cheng, Qi (##IK), 396. 


Cheng, Russell Ch’uan Hsun (2fJ1|3Jl]), 135. 


Chesterton, Gilbert Keith, 417, 537. 
Chi-square distribution, 44, 48, 60, 
69, 135, 590. 
table, 44. 
Chi-square test, 42—47, 53-56, 58-60. 
Childers, James Gregory, 671. 
Ch’in Chiu-Shao (= Qin Jitishdo) 
(BILE), 287, 486. 
Chinese mathematics, 197—198, 287, 
340-341, 486. 
Chinese remainder algorithm, 21, 289-290, 
293, 304-305, 505. 
Chinese remainder theorem, 285-290, 
389, 404, 584. 
for polynomials, 440, 456, 509-510. 
generalized, 292. 
Chio, Felice, 435. 
Chirp transform, 634. 
Chiu Chang Suan Shu (JL Z##i7), 340. 
Choice, random, 2, 119-121, 142. 
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Chor, Benny (w }P¥8-)3), 669. 

Christiansen, Hanne Delgas, 74. 

Christie Mallowan, Agatha Mary Clarissa 
Miller, 725. 

Chudnovsky, David Volfovich (UyqHoscKkuit, 
Jlapug, Bonpdosus), 280, 311, 533. 

Chudnovsky, Gregory Volfovich 
(Uyanoscxuit, Ppuropuit Bonsdbosus), 
280, 311, 533. 

Church, Alonzo, 178. 

Cipolla, Michele, 682. 

Clarkson, Roland Hunter, 409. 

Classical algorithms, 265-284. 

Clausen, Michael Hermann, 515, 701. 

Clift, Neill Michael, 477—479, 485. 

Clinger, William Douglas, 638. 

CMath: Concrete Mathematics, a book 
by R. L. Graham, D. E. Knuth, 
and O. Patashnik. 

Cochran, William Gemmell, 55. 

Cocke, John, 228. 

Cocks, Clifford Christopher, 407. 

Codes, linear, 711. 

Codes for difficulty of exercises, ix—xi. 

Cody, William James, Jr., 226. 

Coefficients of a polynomial, 418. 

adaptation of, 490—494, 516-517. 
leading, 418, 451-452, 454. 
size of, 420, 451, 457—458, 461. 

Cohen, Daniel Isaac Aryeh, 622. 

Cohen, Henri José, 345, 658, 687, 712. 

Cohn, Paul Moritz, 436, 676. 

Coincidence, 6, 8. 

Collenne, Joseph Désiré, 201. 

Collins, George Edwin, 278, 279, 373, 420, 
428, 453, 454, 460, 677. 

Collision test, 70-71, 74, 158. 

Color values, 284. 

Colson, John, 208. 

Colton, Charles Caleb, vii. 

Column addition, 281, 284. 

Combination, random, 142-148. 

Combination of random number generators, 
33-36, 38, 39. 

Combinations with repetitions, 664. 

Combinatorial matrices, 116. 

Combinatorial number system, 209. 

Commutative law, 230, 333, 418, 500, 
694, 696. 

Commutative ring with identity, 418, 
420, 425. 

Comp. J.: The Computer Journal, a 
publication of the British Computer 
Society since 1958. 

Compagner, Aaldert, 29, 169. 

Companion matrix, 512. 

Comparison: Testing for <, =, or >. 

continued fractions, 654. 

floating point numbers, 233-235, 
239, 242-243. 

fractions, 332. 

mixed-radix, 290. 

modular, 290. 

multiprecision, 281. 
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Complement notations for numbers, 15, 
203-204, 210, 213, 228, 275-276. 
Complete binary tree, 667. 
Completely equidistributed sequence, 177. 
Complex arithmetic, 205, 228, 283, 292, 
307-310, 487, 501, 506, 519, 700, 706. 
Complex numbers, 420, 497. 
representation of, 205-206, 209-210, 292. 
Complex radices, 205-206, 209-210. 
Complexity of calculation, 138, 178-179, 
280, 294-318, 396, 401—402, 416, 453, 
465-485, 494—498, 516-524, 720. 
Composition of power series, 533, 
535-536, 720. 
Computability, 162-163, 178. 
Concave function, 125, 139, 245, 627. 
Conditional expression, 730. 
Congruential sequence, inversive, 32-33, 40. 
Congruential sequence, linear, 10-26, 
145-146, 184-186, 193. 
choice of increment, 10-11, 17, 22, 
89, 97, 185. 
choice of modulus, 12-16, 23, 184. 
choice of multiplier, 16—26, 88-89, 
105-109, 184-185. 
choice of seed, 17, 20, 143, 184. 
period length, 16-23. 
subsequence of, 11, 73. 


Congruential sequence, quadratic, 26-27, 37. 


Conjugate of a complex number, 700, 731. 
Connection Machine, 538. 
Content of a polynomial, 423. 
Context-free grammar, 694. 
Continuant polynomials, 357, 360, 374, 377, 
379, 438, 647, 651, 655, 676. 
Continued fractions, 356-359, 396—401. 
infinite, 358-359, 374. 
quadratic irrationalities, 358, 374-375, 
397-401, 412, 415, 665. 
regular, 346, 358-359, 368, 374-379, 
412, 415, 665. 
with polynomial quotients, 438—439, 
498, 518. 
Continuous binomial distribution, 588. 
Continuous distribution functions, 49, 
53, 57, 60, 121-136. 
Continuous Poisson distribution, 588. 
Convergents, 378, 397, 438-439, 617, 622. 
Conversion of representations, 221, 228, 
252-253, 265, 288-290, 293, 304-305, 
see also Radix conversion. 
Convex function, 125, 139, 245, 684. 
Convolution, 305, 318, 525, 586. 
cyclic, 294, 305-307, 510-512, 520, 521. 
multidimensional, 710. 
negacyclic, 521. 
polynomials, see Poweroids. 
Conway, John Horton, 109, 402, 623. 
Cook, Stephen Arthur, 211, 297, 299, 
312, 318, 672, 707. 
Cooley, James William, 701. 


Coolidge, Julian Lowell, 486. 

Coonen, Jerome Toby, 226. 

Cooper, Curtis Niles, 409. 

Copeland, Arthur Herbert, 177. 

Coppersmith, Don, 182, 183, 500, 
501, 523, 671. 

Cormack, Gordon Villy, 664. 

Coroutine, 375. 

Corput, Johannes Gualtherus van der, 
163-164, 181. 

Correlation coefficient, 72—73, 77, 132. 

Cosine, 247, 490. 

Cotes, Roger, 651. 

Couffignal, Louis, 202. 

Counting law, 694. 

Coupon collector’s test, 63-65, 74, 

76, 158, 180. 
Couture, Raymond, 546, 582. 
Covariance, 67. 
matrix, 60, 69, 139. 

Cover, Thomas Merrill, 571. 

Coveyou, Robert Reginald, 26-27, 37, 
88, 92, 114, 115, 553. 

Cox, Albert George, 278. 

Crandall, Richard Eugene, 403, 632. 

Craps, 190. 

Cray T94 computer, 409. 

Cray X-MP computer, 108. 

Creative writing, 190-193. 

Crelle: Journal fiir die reine und angewandte 
Mathematik, an international journal 
founded by A. L. Crelle in 1826. 

Cryptography, 2, 193, 403-407, 415, 
417, 505. 

Cube root modulo m, 404, 415. 

Cunningham, Allan Joseph Champneys, 666. 

Cusick, Thomas William, 584. 

Cut-and-riffle, 147. 

Cycle in a random permutation, 384, 460. 

Cycle in a sequence, 4, 10, 22, 37—40. 

detection of, 7-8. 

Cyclic convolution, 294, 305-307, 
510-512, 520, 521. 

Cyclotomic polynomials, 394, 451, 
459, 510, 514. 


Dahl, Ole-Johan, 148, 592. 

Daniels, Henry Ellis, 568. 

Dase, Johann Martin Zacharias, 279. 

Datta, Bibhutibhusan (fagfogat We) = 
Bidyaranya, Swami (ara Rma), 
343, 461. 

Daudé, Hervé, 366. 

Davenport, Harold, 648. 

David, Florence Nightingale, 3, 566. 

Davis, Chandler, 606. 

Davis, Clive Selwyn, 651. 

de Bruijn, Nicolaas Govert, 181, 212, 
568, 653, 664, 686, 694. 

cycle, 38—40. 
de Finetti, Bruno, 566. 


de Groote, Hans Friedrich, 708. 

de Jong, Lieuwe Sytse, 515. 

de Jonquiéres, Jean Philippe Ernest 
de Fauque, 465-466, 469, 477. 

de La Vallée Poussin, Charles Jean 
Gustave Nicolas, 381. 

de Lagny, Thomas Fantet, 279, 360. 

de Mairan, Jean-Jacques d’Ortous, 537. 

de Maupertuis, Pierre-Louis Moreau, 537. 

de Moivre, Abraham, 537. 

Debugging, 193, 221-223, 275, 331. 

Decimal computer: A computer that 
manipulates numbers primarily in 
the decimal (radix ten) number 
system, 21, 202-203. 

Decimal digits, 195, 319. 

Decimal fractions, history, 197—198, 326. 

Decimal number system, 197—199, 210, 
320-326, 374. 

Decimal point, 195. 

Decimation, 326, 328. 

Decision, unbiased, 2, 119-121. 

DECsystem 20 computer, 15. 

Decuple-precision floating point, 283. 


Dedekind, Julius Wilhelm Richard, 83, 687. 


sums, generalized, 83-92, 106. 

Definitely greater than, 224, 233-235, 
239, 242-243. 

Definitely less than, 224, 233-235, 
239, 242-243. 

Definition of randomness, 2, 149-183. 

Dégot, Jérôme, 683. 

Degree of a polynomial, 418, 420, 436. 

Degrees of freedom, 44, 495, 517-518, 704. 

Dekker, Theodorus Jozef, 242, 244, 253. 

Deléglise, Marc, 667. 

Dellac, Hippolyte, 465. 

DeMillo, Richard Allan, 675. 

Denneau, Monty Montague, 311. 

Density function, 124-126, 139. 

nearly linear, 126. 

Dependent normal deviates, 132, 139. 

Derandomization, 414. 

Derflinger, Gerhard, 138. 

Derivatives, 124, 439, 489, 524, 526, 537. 

Descartes, René, 407. 

Determinants, 356, 373, 432, 434, 
498-500, 523-524. 

Deviate: A random number. 

Devroye, Luc Piet-Jan Arthur, 138. 

Dewey, Melville (= Melvil) Louis Kossuth, 
notation for trees, 555. 

Diaconis, Persi Warren, 145, 263, 264, 622. 

Diamond, Harold George, 245. 

Dice, 2, 7, 42—43, 45—46, 58, 120-121, 190. 

Dickman, Karl Daniel, 382-383. 

Dickman-—Golomb constant, 661. 

Dickson, Leonard Eugene, 287, 387, 646. 

Dictionaries, 201-202. 
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Dieter, Ulrich Otto, 89, 91, 92, 101, 
114, 116, 119, 129-130, 132, 134, 
137, 573, 574, 588. 

Differences, 297—298, 504, 516. 

Differential equations, 526-527. 

Differentiation, see Derivatives. 

Diffie, Bailey Whitfield, 406. 

Digit: One of the symbols used in positional 
notation; usually a decimal digit, one 
of the symbols 0, 1, ..., or 9. 

binary, 195, 200. 
decimal, 195, 319. 
hexadecimal, 195, 210. 
octal, 210. 

Dilcher, Karl Heinrich, 403. 

Dilogarithm, 621. 

Diophantine equations, 343-345, 354, 
417, 449, 648. 

Diophantus of Alexandria (Atógavtoç 
ô ‘AdeEavdpetc), see Diophantine 
equations. 

Direct product, 520, 522-523. 

Direct sum, 520, 522-523. 

conjecture, 708. 

Directed graph, 480-481, 484—485. 

Dirichlet, Johann Peter Gustav 
Lejeune, 342. 

series, 536-537, 695. 

Discrepancy, 39, 110-115. 

Discrete distribution functions, 48, 
120-121, 136-138. 

Discrete Fourier transforms, 169, 305-311, 
316-318, 501-503, 506, 512, 516, 
520-521, 524, 595. 

Discrete logarithms, 417. 

Discriminant of a polynomial, 674, 686. 

Distinct-degree factorization, 447—449, 
459, 689. 

Distribution: A specification of probabilities 
that govern the value of a random 
variable, 2, 119, 121. 

beta, 134-135. 

binomial, 136-138, 141, 401, 559. 

chi-square, 44, 48, 60, 69, 135, 590. 

exponential, 133, 137, 589. 

F-, 135. 

of floating point numbers, 253-264. 

gamma, 253-264. 

geometric, 136, 137, 140, 585. 

integer-valued, 136-141. 

Kolmogorov—Smirnov, 57—60. 

of leading digits, 254-264, 282, 404. 

negative binomial, 140. 

normal, 56, 122, 132, 139, 384, 565. 

partial quotients of regular continued 
fractions, 362-369, 665. 

Poisson, 55, 137—138, 140, 141, 538, 570. 

of prime factors, 382-384, 413. 

of prime numbers, 381-382, 405. 

Student’s, 135. 

t-, 135. 

tail of binomial, 167. 

tail of normal, 139. 

uniform, 2, 10, 48, 61, 119, 121, 124, 263. 

variance-ratio, 135. 

wedge-shaped, 125-126. 
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Distribution functions, 48, 121, 140, 
263, 362, 382-384. 
continuous, 49, 53, 57, 60, 121-136. 
discrete, 48, 120-121, 136-138. 
empirical, 49. 
mixture of, 123-124, 138. 
polynomial, 138. 
product of, 121. 
Distributive laws, 231, 334, 418, 694. 
Divide-and-correct, 270-275, 278, 282-284. 
Divided differences, 504, 516. 
Dividend: The quantity u while computing 
[u/v| and u mod v, 270. 
Division, 194, 265, 270-275, 278, 282-284, 
311-313. 
algebraic numbers, 333, 674. 
avoiding, 523-524. 
balanced ternary, 283. 
by ten, 321, 328. 
by zero, 220, 224, 241, 639. 
complex, 228, 283, 706. 
continued fractions, 649. 
double-precision, 251-252, 278-279. 
exact, 284. 
floating point, 220-221, 230-231, 243. 
fractions, 330. 
long, 270-275, 278, 282-284. 
mixed-radix, 209, 635. 
mod m, 354, 445, 499. 
multiprecision, 270-275, 278-279, 
282-283, 311-313. 
multiprecision by single-precision, 282. 
polynomial, 420-439, 487, 534. 
power series, 525-526, 533-534. 
pseudo-, 425-426, 435—436. 
quater-imaginary, 283. 
short, 282. 
string polynomials, 436—437. 
Divisor: The quantity v while computing 
[u/v| and u mod v, 270. 
Divisor: x is a divisor of y if ymoda = 0 
and x > 0; it is a proper divisor if it is 
a divisor such that 1 < x < y. 
polynomial, 422. 
Dixon, John Douglas, 372, 401—402, 
412, 414, 415, 417. 
Dixon, Wilfrid Joseph, 565. 
Dobell, Alan Rodney, 17. 
Dobkin, David Paul, 697, 712. 
Dodgson, Charles Lutwidge, 435. 
Donsker, Monroe David, 559. 
Doob, Joseph Leo, 559. 
Dorn, William Schroeder, 488. 
Dot product, 36, 97, 173-174, 499-501. 
Double-precision arithmetic, 246-253, 
278-279, 295. 
Doubling, 322, 462. 
continued fraction, 375. 
Doubling step, 467. 
Downey, Peter James, 485. 
Dragon curve, 606, 609, 655. 


Dragon sequence, 655. 

Dresden, Arnold, 196. 

Drift, 237, 244. 

Du Shiran (#44), 287. 

Dual of an addition chain, 481, 484, 485. 
Duality formula, 569. 

Duality principle, 481, 485, 507, 535, 718. 
Dubner, Harvey Allen, 664. 

Duffield, Nicholas Geoffrey, 593. 

Dumas, Philippe, 355. 

Duncan, Robert Lee, 264. 

Duodecimal number system, 199-200. 
Dupré, Athanase, 653. 

Durbin, James, 57, 568. 

Durham, Stephen Daniel, 34. 
Durstenfeld, Richard, 145. 


e (base of natural logarithms), 12, 76, 
359, 726-727, 733. 
as “random” example, 21, 33, 47, 52, 108. 
Earle, John Goodell, 312. 
Eckhardt, Roger Charles, 189. 
L’Ecuyer, Pierre, 108, 179, 546, 582, 
584, 603. 
Edelman, Alan Stuart, 280. 
Edinburgh rainfall, 74. 
EDVAC computer, 225-226. 
Effective algorithms, 161-166, 169, 178. 
Effective information, 179. 
Egyptian mathematics, 335, 462. 
Eichenauer-Herrmann, Jürgen, 32, 558, 559. 
Eisenstein, Ferdinand Gotthold Max, 457. 
Electrologica X8 computer, 222. 
Electronic mail, 406. 
Elementary symmetric functions, 682—683. 
Elkies, Noam David, xi. 
Ellipsoid, 105. 
random point on, 141. 
Elliptic curve method, 402, 671. 
Elvenich, Hans-Michael, 409. 
Empirical distribution functions, 49. 
Empirical tests for randomness, 41, 61-80. 
Encoding a permutation, 65—66, 77—78, 145. 
Encoding secret messages, 193, 403—407, 
415, 417. 
Enflo, Per, 683. 
Engineering Research Associates, 208. 
Enhancing randomness, 26, 34. 
ENIAC computer, 54, 280. 
Entropy, 712. 
Enumerating binary trees, 527, 696, 723. 
Enumerating prime numbers, 382, 412, 416. 
Equality, approximate, 224, 233-235, 
239, 242-243, 245. 
essential, 233-235, 239-240, 242-244. 
Equidistributed sequence, 150, 163, 
177, 179-183. 
Equidistribution test, 61, 74, 75. 
Equitable distribution, 181. 
Equivalent addition chains, 480, 484. 


Eratosthenes of Cyrene (’Epatoobévng 
ô Kvpnvaioç), 412. 

Erdős, Pál (= Paul), 181, 384, 471, 696. 

ERH, see Extended Riemann hypothesis. 

ERNIE machine, 3—4. 

Error, absolute, 240, 309, 312-313. 

Error, relative, 222, 229, 232, 253, 255. 

Error estimation, 222, 229, 232, 253, 
255, 309-310. 

Espelid, Terje Oskar, 616. 

Essential equality, 233-235, 239-240, 
242-244. 

Estes, Dennis Ray, 671. 

Estrin, Gerald, 488. 

Euclid (Eòxetònç), 335-337. 

Euclid’s algorithm, 86, 99, 102, 117, 184, 
288, 290, 304, 334-337, 340, 579. 

analysis of, 356-379. 

compared to binary algorithm, 341. 
extended, 342-343, 354, 379, 435—436, 534. 
for polynomials, 424, 438—439. 

for polynomials, extended, 437, 458. 

for string polynomials, 426—428. 
generalized to the hilt, 426-428. 
multiprecision, 345-348, 373. 

original form, 335-336. 

Eudoxus of Cnidus, son of Æschinus 
(Ev8o0Eog Aioxivov 6 Kvtétoc). 

Euler, Leonhard (Eitnept, Jleonapas = 
Oiinep, JIeonapa), xi, 357, 375, 377, 
392, 407, 526, 649-651, 653, 655. 

constant y, 359, 379, 726-727, 733. 
theorem, 20, 286, 548. 
totient function y(n), 19-20, 289, 
369, 376, 583, 646. 

Eulerian numbers, 284. 

Evaluation: Computing the value. 
of determinants, 498-500, 523-524. 
of mean and standard deviation, 232, 244. 
of monomials, 485, 697. 
of polynomials, 485-524. 
of powers, 461—485. 

Eve, James, 493, 517. 

Eventually periodic sequence, 7, 22, 
375, 385. 

Exact division, 439. 

Excess q exponent, 214-215, 227, 246. 

Exclusive or, 31, 32, 193, 419. 

Exercises, notes on, ix—xi. 

Exhaustive search, 103. 

Exponent overflow, 217, 221, 227, 231, 
241, 243, 249. 

Exponent part of a floating point number, 
214-215, 246, 263, 283. 

Exponent underflow, 217, 221-222, 227, 
231, 241, 249. 

Exponential deviates, generating, 
132-133, 137. 

Exponential distribution, 133, 137, 589. 

Exponential function, 313, 490, 533, 537. 
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Exponential sums, 84-85, 110-115, 181, 
305, 382, 501. 
Exponentiation: Raising to a power, 
461-485. 
multiprecision, 463. 
of polynomials, 463—464. 
of power series, 526, 537, 719. 
Extended arithmetic, 244-245, 639. 
Extended Euclidean algorithm, 342-343, 
354, 379, 435-436, 534. 
for polynomials, 437, 458. 
Extended Riemann Hypothesis, 
395-396, 671. 


F-distribution, 135. 

Factor: A quantity being multiplied. 

Factor method of exponentiation, 463, 

465, 482, 485. 

Factorial number system, 66, 209. 

Factorial powers, 297, 515, 534, 643, 731. 

Factorials, 416, 622. 

Factorization: Discovering factors. 
of integers, 13-14, 175, 379-394, 

396-403, 412-417. 
of polynomials, 439-461, 514. 
of polynomials mod p, 439-449, 455—456. 
of polynomials over the integers, 449-453. 
of polynomials over the rationals, 459. 
optimistic estimates of running time, 176. 
uniqueness of, 436. 

FADD (floating add), 223-224, 227-228, 

238, 253, 516. 

Fagin, Barry Steven, 403, 632. 

Fallacious reasoning, 26, 74, 88, 95-96, 222. 

Falling powers, 297, 731. 

Fan, Chung Teh (7 324#), 143. 

Fast Fourier transform, 73, 306-310, 

318, 502, 505, 512, 516, 521, 706, 
710, 713-714. 
history of, 701. 

Fateman, Richard J, 463. 

Faure, Henri, 164. 

FCMP (floating compare), 223, 244. 

FDIV (floating divide), 223. 

Feijen, Wilhelmus (= Wim) Hendricus 

Johannes, 636. 

Ferguson, Donald Fraser, 280. 

Fermat, Pierre de, 386-388, 391, 407, 579. 
factorization method, 386-391, 412. 
numbers, 14, 386, 397, 403. 
theorems, 391, 413, 440, 579, 680. 

Ferranti Mark I computer, 3, 192. 

Ferrenberg, Alan Milton, 188. 

FFT, 516, see Fast Fourier transform. 

Fibonacci, Leonardo, of Pisa (= Leonardo 

filio Bonacii Pisano), 197, 208, 280. 
generator, 27, 34, 36, 37, 47, 52, 54, 92. 
number system, 209. 
numbers Fn: Elements of the Fibonacci 

sequence, 731. 
numbers, table of, 728. 
sequence, 27, 37, 213, 264, 360, 468, 

483, 660, 666. 
sequence, lagged, 27—29, 35, 40, 72, 75, 

79-80, 146, 186-188, 193. 
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Field: An algebraic system admitting 
addition, subtraction, multiplication, 
and division, 213, 331, 420, 422, 
506, 525. 
finite, 29-30, 449, 457, 554-555, 702. 
Fike, Charles Theodore, 490. 
Finck, Pierre Joseph Etienne, 360. 
Findley, Josh Ryan, 409. 
Fine, Nathan Jacob, 90. 
Finetti, Bruno de, 566. 
Finite fields, 29-30, 449, 457, 554-555, 702. 
Finite Fourier transform, see Discrete 
Fourier transform. 
Finite sequences, random, 167—170, 178. 
Fischer, Michael John, 634. 
Fischer, Patrick Carl, 241. 
Fischlin, Roger, 669. 
Fisher, Ronald Aylmer, 145. 
Fishman, George Samuel, 108. 
FIX (convert to fixed point), 224. 
Fix-to-float conversion, 221, 223-224. 
Fixed point arithmetic, 214, 225-226, 
308-310, 532. 
Fixed slash arithmetic, 331-333, 379. 
ajolet, Philippe Patrick Michel, 355, 
366, 449, 541, 644. 
ammenkamp, Achim, 478, 483, 693. 
at distribution, see Uniform distribution. 
lehinger, Betty Jeanne, 262. 
oat-to-fix conversion, 224-225, 228. 
loating binary numbers, 214, 225, 
227, 254, 263. 
oating decimal numbers, 214, 226, 
254-264. 
oating hexadecimal numbers, 254, 263. 
oating point arithmetic, 36, 188, 193, 
196, 214-264, 292. 
accuracy of, 222, 229-245, 253, 329, 
438, 485. 
addition, 215-220, 227—228, 230-231, 
235-238, 253-254, 602. 
addition, exact, 236. 
axioms, 230-231, 242-245. 
comparison, 233-235, 239, 242-243. 
decuple-precision, 283. 
division, 220-221, 230-231, 243. 
double-precision, 246-253, 278-279. 
hardware, 223-226. 


rj 


Qo y gygy 


intervals, 228, 240-242, 244-245, 333, 613. 


mod, 228, 243, 244. 

multiplication, 220, 230-231, 243, 
263-264. 

multiplication, exact, 244. 

operators of MIX, 215, 223-225, 516. 

quadruple-precision, 253. 

reciprocal, 243, 245, 263. 

single-precision, 214-228. 

subtraction, 216, 230-231, 235-238, 
245, 253, 556, 602. 

summation, 232, 244. 

triple-precision, 252. 

unnormalized, 238-240, 244, 327. 


Floating point numbers, 196, 214-215, 

222, 228, 246. 
radix-b, excess-q, 214-215. 
statistical distribution, 253-264. 
two’s complement, 228. 

Floating point radix conversion, 326-329. 

Floating point trigonometric subroutines, 
245, 247, 490. 

Floating slash arithmetic, 331, 333. 

Floor function |x|, 81, 732. 

FLOT (convert to floating point), 223. 

Floyd, Robert W, 7, 148, 280, 361, 505, 540. 

Fluhrer, Scott, 542. 

FMUL (floating multiply), 223, 516. 

Foata, Dominique Cyprien, 9. 

FOCS: Proceedings of the IEEE Symposia 
on Foundations of Computer Science 
(1975-), formerly called the Symposia 
on Switching Circuit Theory and 
Logic Design (1960-1965), Symposia 
on Switching and Automata Theory 
(1966-1974). 

Forsythe, George Elmer, 4, 128. 

FORTRAN language, 188, 193, 279, 600, 602. 

Fourier, Jean Baptiste Joseph, 278. 

division method, 278. 

series, 90, 487. 

transform, discrete, 169, 305-311, 
316-318, 501-503, 506, 512, 516, 
520-521, 524, 595. 

Fowler, Thomas, 208. 

Fractals, 206. 

Fraction overflow, 217, 254, 262, 264. 

Fraction part of a floating point number, 
214-215, 246, 263. 

distribution of, 254—264. 

Fractions: Numbers in [0..1), 36. 

conversion, 319-328. 

decimal, history, 197—198, 326. 
exponentiation, 483. 

random, see Uniform deviates. 
terminating, 328. 

Fractions: Rational numbers, 330-333, 
420, 526. 

Fraenkel, Aviezri S (37379 ?NWIN), 290, 
291, 292, 630. 

Franel, Jérome, 258. 

Franklin, Joel Nick, 149, 158, 159, 177, 
180, 182, 577. 

Franta, William Ray, 60. 

Fredricksen, Harold Marvin, 557. 

Free associative algebra, 437. 

Frequency function, see Density function. 

Frequency test, 61, 74, 75. 

Friedland, Paul, 613. 

Frieze, Alan Michael, 599. 

Fritz, Kurt von, 335. 

Frobenius, Ferdinand Georg, 539, 681, 689. 

automorphism, 689. 

Frye, Roger Edward, 538. 

FSUB (floating subtract), 223, 253. 

Fuchs, Aimé, 9. 

Fundamental theorem of arithmetic, 

334, 422, 483. 

Fuss, Paul Heinrich von (®ycc, Hase 

Huxonaesu4), 392, 651. 


Gage, Paul Vincent, 409. 

Galambos, Janos, 661. 

Galois, Evariste, 449, 457. 

fields, see Finite fields. 
groups, 679, 681, 689, 690. 

Gambling systems, 161. 

Gamma distribution, 133-134, 140. 

Gamma function, incomplete, 56, 59, 133. 

Ganz, Jürg Werner, 707. 

Gap test, 62-63, 74-76, 136, 158, 180. 

Gardner, Martin, 41, 200, 280, 592. 

Garner, Harvey Louis, 280, 290, 292. 

Gathen, Joachim Paul Rudolf von zur, 
449, 611, 673, 687. 

Gau8 (= Gauss), Johann Friderich Carl 
(= Carl Friedrich), 20, 101, 363, 417, 
422, 449, 578, 679, 685, 688, 701. 

integers, 292, 345, 579. 
lemma about polynomials, 422—423, 682. 

Gay, John, 1. 

gcd: Greatest common divisor. 

Gebhardt, Friedrich, 34. 

Geiger, Hans, counter, 7. 

Geiringer, Hilda, von Mises, 76. 

Generalized Dedekind sums, 83-92, 106. 

Generalized Riemann hypothesis, 396. 

Generating functions, 140, 147, 213, 261, 
276-278, 525, 562-563, 679-680, 
686, 695. 

Generation of uniform deviates, 10—40, 
184-189, 193. 

Genuys, Francois, 280. 

Geometric distribution, 136, 137, 140, 585. 

Geometric mean, 283. 

Geometric series, 84, 307, 519, 700. 

Gerhardt, Carl Immanuel, 200. 

Gerke, Friedrich Clemens, 653. 

Gessel, Ira Martin, 723. 

Gibb, Allan, 242. 

Gilbert, William John, 607. 

Gill, Stanley, 226. 

Gimeno Fortea, Pedro, 187. 

GIMPS project, 409. 

Gioia, Anthony Alfred, 469. 

Girard, Albert, 424. 

Givens, James Wallace, Jr., 94. 

Glaser, Anton, 201. 

Globally nonrandom behavior, 51-52, 80. 

Goertzel, Gerald, 487. 

Goffinet, Daniel, 607. 

Goldbach, Christian, 392, 651. 

Goldberg, David Marc, 226. 

Golden ratio, 164, 283, 359, 360, 514, 
652, 726-727, 733. 

Goldreich, Oded (71791 Ty), 179, 
598, 669. 

Goldschmidt, Robert Elliott, 312. 

Goldstein, Daniel John, 593. 

Goldstine, Herman Heine, 202, 278, 327. 

Goldwasser, Shafrira, 598. 

Golomb, Solomon Wolf, 147, 661, 711. 
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Golub, Gene Howard, 562. 
Gonzalez, Teofilo, 60. 
Good, Irving John, 183. 
Goodman, Allan Sheldon, 108. 
Gosper, Ralph William, Jr., 101, 107, 117, 
355, 375, 378, 540, 649. 
Gosset, William Sealy (= Student), 
t-distribution, 135. 
Goulard, Achille, 477. 
Gould, Henry Wadsworth, 725. 
Gourdon, Xavier Richard, 355, 449, 667. 
Goyal, Girish Kumar ( Sart 
), 625. 
Gradual underflow, 222. 
Graffe, Carl Heinrich, 683. 
Graham, Ronald Lewis, 484, 608, 741. 
Gram, Jgrgen Pedersen, 674. 
Gram-Schmidt orthogonalization 
process, 101, 674. 
Granville, Andrew James, 396, 659. 
Graph, 480-481, 485. 
Graphics, 284. 
Gray, Frank, binary code, 209, 568, 699. 
Gray, Herbert Lee Roy, 242. 
Gray levels, multiplication of, 284. 
Great Internet Mersenne Prime Search, 409. 
Greater than, definitely, 224, 233-235, 
239, 242-243. 
Greatest common divisor, 330-356, 483. 
binary algorithms for, 338-341, 
348-356, 435. 
Euclidean algorithm for, see Euclid’s 
algorithm. 
multiprecision, 345-348, 354, 355, 
379, 656. 
of n numbers, 341, 378. 
of polynomials, 424—439, 460, 453-455. 
within a unique factorization domain, 424. 
Greatest common right divisor, 437—438. 
Greedy algorithm, 293. 
Greek mathematics, 196-197, 225, 
335-337, 359. 
Green, Bert Franklin, Jr., 27. 
Greenberger, Martin, 17, 88, 551. 
Greene, Daniel Hill, 659, 686, 697. 
Greenwood, Joseph Arthur, 33. 
Greenwood, Robert Ewing, 74. 
Gregory, Robert Todd, 657. 
GRH: The ERH for algebraic numbers, 396. 
Groote, Hans Friedrich de, 708. 
Grosswald, Emil, 90. 
Grotefeld, Andreas Friedrich Wilhelm, 656. 
Groups, 701. 
Galois, 679, 681, 689, 690. 
Grube, Andreas, 582. 
Grünwald, Vittorio, 204, 205. 
Guaranteed randomness, 35-36, 170-176. 
Guard digits, 227. 
Gudenberg, see Wolff von Gudenberg. 
Guessing, amplified, 172-174, 416-417. 
Guilloud, Jean, 280. 
Gustavson, Fred Gehrung, 721. 
Guy, Michael John Thirian, 623. 
Guy, Richard Kenneth, 402, 413. 
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Haber, Seymour, 164. 
Habicht, Walter, 435. 
Hadamard, Jacques Salomon, 382, 432. 
inequality, 432, 436, 499. 
transform, 173, 502. 
Hafner, James Lee, 661. 
Hajjar, Mansour (,laa pai), 29. 
Hajratwala, Nayan (8144 zaua), 409. 
HAKMEM, 540, 649. 
Halberstam, Heini, 664. 
Halewyn, Christopher Neil van, 403. 
Halliwell-Phillipps, James Orchard, 316. 
Halton, John Henry, 164. 
Halving, 328, 338, 339, 462. 
continued fraction, 375. 
modular, 293. 
Hamblin, Charles Leonard, 420. 
Hamlet, Prince of Denmark, v. 
Hammersley, John Michael, 189. 
Hamming, Richard Wesley, 255, 263. 
Handscomb, David Christopher, 189. 
Handy identities, 628-629. 
Hansen, Eldon Robert, 617. 
Hansen, Walter, 473, 475, 476, 479, 483. 
Hanson, Richard Joseph, 615. 
Haralambous, Yannis (Xaparé&prove, 
*Twdwne), 764. 
Hard-core bit, 179. 
Hardware: Computer circuitry. 
algorithms suitable for, 228, 
244 (exercise 17), 280, 312, 313-316, 
322, 327-329, 338, 356, 461, 695. 
Hardy, Godfrey Harold, 382, 384, 653. 
Harmonic numbers Hn, 731. 
fractional, 362, 729. 
table of, 728-729. 
Harmonic probability, 264. 
Harmonic sums, 355. 
Harriot, Thomas, 199. 
Harris, Bernard, 541. 
Harris, Vincent Crockett, 341, 355. 
Harrison, Charles, Jr., 242. 
Hashing, 70, 148. 
Hassler, Hannes, 699. 
Hastad, Johan Torkel, 179, 417, 514, 599. 
Haynes, Charles Edmund, Jr., 108. 
hef, see Greatest common divisor. 
Hebb, Kevin Ralph, 477. 
Heideman, Michael Thomas, 701. 
Heilbronn, Hans Arnold, 372, 377. 
Heindel, Lee Edward, 677. 
Hellman, Martin Edward, 406. 
Henrici, Peter Karl Eugen, 332, 505. 
Hensel, Kurt Wilhelm Sebastian, 213, 685. 
lemma, 38, 451, 454-455, 458, 685-686. 
Hensley, Douglas Austin, 366, 373. 
Heringa, Jouke Reyn, 29. 
Hermelink, Heinrich, 208. 
Hermite, Charles, 115, 579. 
Herrmann, Hans Jürgen, 29. 
Hershberger, John Edward, 366. 


Herzog, Thomas Nelson, 178, 594, 598. 
Hexadecimal digits, 195, 210. 
Hexadecimal number system, 201—202, 
204, 210, 324, 639. 
floating point, 254, 263. 
nomenclature for, 201. 
Higham, Nicholas John, 242. 
Hilferty, Margaret Mary, 134. 
Hill, Ian David, 544. 
Hill, Theodore Preston, 262. 
himult, 15, 584. 
Hindu mathematics, 197, 208, 209, 281, 
287, 343, 387, 461, 648. 
HITACHI SR2201 computer, 280. 
Hitchcock, Frank Lauren, 506. 
Hlawka, Edmund, 117. 
HLT (halt), 222. 
Hobby, John Douglas, 764. 
Hoffmann, Immanuel Carl Volkmar, 279. 
Hofstadter, Douglas Richard, 330. 
Holte, John Myrom, 629. 
Homann, Karsten, 736. 
Homogeneous polynomial, 437, 458, 698. 
Hopcroft, John Edward, 500, 507, 699. 
Hörmann, Wolfgang, 138. 
Horner, Horst Helmut, 118. 
Horner, William George, 486. 
rule for polynomial evaluation, 486—489, 
498, 504, 515, 517, 519. 
Horowitz, Ellis, 505. 
Howard, John Vernon, 178. 
Howell, Thomas David, 708. 
Hoyle, Edmond, rules, 147. 
Huff, Darrell Burton, 42. 
Hull, Thomas Edward, 17. 
Hurwitz, Adolf, 345, 375, 376, 649. 
Huygens (= Huyghens), Christiaan, 655. 
Hyde, John Porter, 419. 
Hyperbolic tangent, 375. 
Hyperplanes, 96, 97, 116. 


IBM 704 computer, 280. 

IBM 7090 computer, 280. 

IBM System/360 computers, 396-397, 614. 

IBM System/370 computers, 15. 

Ibn Ezra (= Ben Ezra), Abraham ben Meir 
(NY JAN PNN 72 DDIIN), also known as 
Abu Ishaq Ibrahim al-Majid 
(Lge Ge sall aall Gaul wl), 197. 

Idempotent, 539, 694. 

Identity element, 418. 

IEEE standard floating point, 226, 246, 602. 

Ikebe, Yasuhiko (WZ JGZ), 252. 

Ill-conditioned matrix, 292. 

Images, digitized, 284. 

Imaginary radix, 205-206, 209-210, 283. 

Impagliazzo, Russell Graham, 179. 

Improving randomness, 26, 34. 

IMSL: The International Mathematics and 
Statistics Library, 108. 

in situ transformation, 700. 


Inclusion and exclusion principle, 354, 
563, 610, 640, 678, 699. 
Incomplete gamma function, 56, 59, 133. 
Increment in a linear congruential sequence, 
10-11, 17, 22, 89, 97, 185. 
Independence, algebraic, 496, 518. 


Independence, linear, 443—444, 508, 659-660. 


Independence of random numbers, 2, 43, 
46, 55, 59, 66, 95, 240, 559, 562. 

Index modulo p, 417. 

Indian mathematics, 197, 208, 209, 281, 
287, 343, 387, 461, 648. 

Induced functions, 535. 

Induction, mathematical, 336. 

on the course of computation, 266, 

269, 337. 

Inductive assertions, 281—282. 

Infinite continued fractions, 358-359, 374. 


Infinity, representation of, 225, 244—245, 332. 


Inner product, 97, 499-501, 520. 
Integer, random, 
among all positive integers, 257, 
264, 342, 354. 
in a bounded set, 119-121, 138, 185-186. 
Integer solution to equations, 343-345, 
354, 417, 449, 648. 
Integer-valued distributions, 136-141. 
Integrated circuit module, 313. 
Integration, 153-154, 259. 
Interesting point, 642. 
Internet, iv, x. 
Interpolation, 297, 365, 503-505, 509, 
516, 700, 721. 
Interpretive routines, 226. 
Interval arithmetic, 228, 240-242, 
244-245, 333, 613. 
Inverse Fourier transform, 307, 316, 
516, 633. 
Inverse function, 121, see also Reversion 
of power series. 
Inverse matrix, 98, 331, 500, 524. 
Inverse modulo 2°, 213, 629. 
Inverse modulo m, 26, 354, 445, 456, 646. 
Inversive congruential sequence, 32-33, 40. 
Irrational numbers: Real numbers that 
are not rational, 181, 359. 
multiples of, mod 1, 164, 379, 622. 
transcendental, 378. 
Irrational radix, 209. 
Irrationality, quadratic, 358, 374-375, 
397-401, 412, 415, 665. 
Irreducible polynomial, 422, 435, 450, 
456-457, 460. 
Ishibashi, Yoshihiro (EA), 291. 
Islamic mathematics, 197, 280-281, 
326, 461—462. 
Iteration of power series, 530-536, 723. 
Iterative n-source, 172. 
Iverson, Kenneth Eugene, 226. 
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Jabotinsky, Eri, 533, 536, 723. 

JACM: Journal of the ACM, a publication 
of the Association for Computing 
Machinery since 1954. 

Jacobi, Carl Gustav Jacob, 662. 

symbol, 413-414, 415, 655, 662, 668. 

JAE (jump A even), 339, 481. 

Jaeschke, Gerhard Paul Werner, 666. 

Jager, Hendrik, 665. 

Ja’Ja’ (= JaJa), Joseph Farid 

mea pi cinsa), 514. 

Janies Prank, 107, 114. 

Jansson, Birger, 540, 553. 

JAO (jump A odd), 339, 612. 

Japanese mathematics, 648. 

Jayadeva, Acārya (ATH Wats), 648. 

Jebelean, Tudor, 629. 

Jefferson, Thomas, 229. 

Jensen, Geraldine Afton, 466. 

Jensen, Johan Ludvig William Valdemar, 
683. 

Jevons, William Stanley, 388. 

Jiuzhang Suanshu (JLE), 340. 

Johnk, Max Detlev, 134. 

Johnson, Don Herrick, 701. 

Johnson, Jeremy Russell, 625. 

Johnson, Samuel, 229. 

Jokes, 3, 417. 

Jones, Hugh, 200, 326. 

Jones, Terence Gordon, 143. 

Jong, Lieuwe Sytse de, 515. 

Jonquiéres, Jean Philippe Ernest de Fauque 
de, 465—466, 469, 477. 

Jordaine, Joshua, 199. 

Judd, John Stephen, 394. 

Jurkat, Wolfgang Bernhard, 699. 

Justeson, John Stephen, 196. 

JXE (jump X even), 339. 

JXO (jump X odd), 219, 339. 


k-distributed sequence, 151-155, 168, 
177, 179-182. 

Kac, Mark, 384. 

Kahan, William Morton, 222, 226, 227, 
241-245, 617. 

summation formula, 615. 

Kaib, Michael Andreas, 578. 

Kaltofen, Erich Leo, 345, 449, 455, 672, 718. 

Kaminski, Michael, 712. 

Kanada, Yasumasa (4sHHEE IE), 280. 

Kankaala, Kari Veli Antero, 75, 570. 

Kannan, Ravindran (rah Sor 
serrer), 599. 

Kanner, Herbert, 327. 

Karatsuba, Anatolii Alekseevich (Kapany6a, 
Anatonuit A nexceepu4), 295, 
318, 420, 663. 

Karlsruhe, University of, 242. 

Kátai, Imre, 607. 

Katz, Victor Joseph, 198. 

Kayal, Neeraj ( mate), 396. 

Keir, Roy Alex, 237, 638. 
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Keller, Wilfrid, 664, 666. 

Kempner, Aubrey John, 204, 378. 

Kendall, Maurice George, 3, 74, 76. 

Kermack, William Ogilvy, 74. 

Kerr, Leslie Robert, 699. 

Kesner, Oliver, 226. 

Khinchin, Alexander Yakovlevich (Xunuun, 
Anexcangap Axopnesu4), 356, 652. 

Killingbeck, Lynn Carl, 103, 107. 

Kinderman, Albert John, 130-131, 135. 

Klarner, David Anthony, 213. 

Klem, Laura, 27. 

Knop, Robert Edward, 136. 

Knopfmacher, Arnold, 345, 686. 

Knopfmacher, John Peter Louis, 345. 

Knopp, Konrad Hermann Theodor, 364. 

Knorr, Wilbur Richard, 335. 

Knott, Cargill Gilston, 627. 

Knuth, Donald Ervin (#44), ii, iv, vii, 2, 
4, 30, 89, 138, 145, 159, 189, 196, 205, 
226, 242, 316, 335, 373, 378, 384, 435, 
491, 584, 595, 599, 606, 636, 659, 661, 
686, 694, 697, 722, 741, 764. 

Knuth, Jennifer Sierra (f/Y), xiv. 

Knuth, John Martin (ih), xiv. 

Kohavi, Zvi (72919 1), 498. 

Koksma, Jurjen Ferdinand, 161. 

Kolmogorov, Andrei Nikolaevich 
(Konmoropos, Anzpeit Huxonaesus), 
56, 169, 178, 183. 

Kolmogorov—Smirnov distribution, 57—60. 

table, 51. 

Kolmogorov—Smirnov test, 48—60. 

Kondo, Shigeru (j/T}R), 280. 
Kontorovich, Alex Vladimir (Konroposns, 
Anexcangp Baagumuposn4), 584. 

Koons, Florence, 327. 

Kornerup, Peter, 332-333, 629, 657. 

Korobov, Nikolai Mikhailovich (Kopo6os, 
Huxonait Muxaiinosus), 114, 159, 177. 

Kovacs, Béla, 607. 

Kraitchik, Maurice Borisovitch (Kpaitunx, 
Meep Bopucosus), 396, 407. 

Krandick, Werner, 625, 629. 

Krishnamurthy, Edayathumangalam 
Venkataraman (er wng LW OULD 
Qeumiat_oriner AAKT pih), 
278, 279. 

Kronecker, Leopold, 450, 678, 688, 730. 

Kruskal, Martin David, 542. 

KS test, see Kolmogorov—Smirnov test. 

Kuczma, Marek, 533. 

Kuipers, Lauwerens, 114, 177. 

Kulisch, Ulrich Walter Heinz, 242, 245. 

Kung, Hsiang Tsung (fL## #1), 356, 
529-530, 533, 720. 

Kurita, Yoshiharu (HH), 29, 572, 604. 

Kurowski, Scott James, 409. 

Kuttaka (gf), 287, 343. 

Kuz’min, Rodion Osievich (Ky3p»mun, 
Pogmon Ocuesu4), 363. 


19-chain, 479, 483, 485. 

L? algorithm, 118, 417, 453. 

La Touche, Maria Price, 194, 230. 

La Vallée Poussin, Charles Jean Gustave 
Nicolas de, 381. 

Laderman, Julian David, 700. 

Lagarias, Jeffrey Clark, 416, 599, 667. 
Lagged Fibonacci sequences, 27—29, 35, 40, 
72, 75, 79-80, 146, 186-188, 193. 

Lagny, Thomas Fantet de, 279, 360. 

Lagrange (= de la Grange), Joseph Louis, 
Comte, 375, 378, 456, 503, 527, 533, 
549, 649, 653, 655. 

interpolation formula, 503-505. 
inversion formula, 527-528, 533-534, 723. 

Lags, 28. 

Lake, George Thomas, 327. 

Lakshman, Yagati Narayana 
(og Moabeo adres), 455. 

Lalanne, Léon Louis Chrétien, 208. 

Lalescu, Gheorghe Liviu, 186. 

Lamé, Gabriel, 360. 

Landau, Edmund Georg Hermann, 621, 683. 

Laplace (= de la Place), Pierre Simon, 
Marquis de, 363. 

Lapko, Olga Georgievna (Jlanko, Onpra 
Teopruesua), 764. 

Large prime numbers, 407—412, 549-550, 
663-664. 

Las Vegas algorithms: Computational 
methods that use random numbers and 
always produce the correct answer if 
they terminate, 447—449, 459, 681-682. 

Lattice of points, 97. 

Lattice reduction, see Short vectors. 

Laughlin, Harry Hamilton, 279. 

Laurent, Paul Mathieu Hermann, series, 723. 

Lauwerier, Hendrik Adolf, 561. 

Lavaux, Michel, 107. 

Lavington, Simon Hugh, 3. 

Lawrence, Frederick William, 390. 

lcm: Least common multiple. 

Leading coefficient, 418, 451—452, 454. 

Leading digit, 195, 239. 

Leading zeros, 222, 238-240, 327. 

Least common left multiple, 437—438. 

Least common multiple, 18, 23, 292, 334, 
337, 353, 411, 483, 641. 

Least remainder algorithm, 376. 

Least significant digit, 195. 

Lebesgue, Henri Léon, measure, 160, 
166-167, 178, 367. 

Lebesgue (= Le Besgue), Victor Amédée, 
341, 662. 

L’Ecuyer, Pierre, 108, 179, 546, 582, 

584, 603. 

Leeb, Hannes, 604. 

Leeuwen, Jan van, 477, 515, 706. 

Left multiple, least common, 437—438. 


Legendre (= Le Gendre), Adrien Marie, 
326-327, 381, 396, 449. 
symbol, 414. 
Léger, Emile, 360. 
Léger, Roger, 587. 
Lehman, Russell Sherman, 387, 405. 
Lehmer, Derrick Henry, 10-11, 47, 54, 149, 
278, 345-346, 382, 390, 391, 394, 396, 
409, 413, 414, 484, 655, 660, 667, 686. 
Lehmer, Derrick Norman, 278, 661. 
Lehmer, Emma Markovna Trotskaia, 391. 
Lehn, Jiirgen, 32, 558. 
Leibniz, Gottfried Wilhelm, Freiherr 
von, 200. 
Lempel, Abraham, 556, 712. 
Lenstra, Arjen Klaas, 118, 403, 417, 
453, 712. 
Lenstra, Hendrik Willem, Jr., 118, 396, 
402-403, 416, 417, 453, 656. 
Leonardo Pisano, see Fibonacci. 
Leong, Benton Lau (JE), 485. 
Leslie, John, 208. 
Less than, definitely, 224, 233-235, 
239, 242-243. 
Leva, Joseph Leon, 132. 
Levene, Howard, 74. 
LeVeque, William Judson, 648. 
Levin, Leonid Anatolievich (Jlesun, Jleonua 
Anatompesu4), 36, 170, 179. 
Levine, Kenneth Allan, 104. 
Lévy, Paul, 363. 
Levy, Silvio Vieira Ferreira, vii. 
Lewis, John Gregg, 615. 
Lewis, Peter Adrian Walter, 108, 701. 
Lewis, Theodore Gyle, 32. 
Lexicographic order, 207, 624. 
li: Logarithmic integral function. 
Li, Ming (ÆH}), 179. 
Li Yan (2=(R%), 287. 
Lickteig, Thomas Michael, 706. 
Lindholm, James H., 79. 
Linear congruential sequence, 10—26, 
145-146, 184-186, 193. 
choice of increment, 10-11, 17, 22, 
89, 97, 185. 
choice of modulus, 12-16, 23, 184. 
choice of multiplier, 16-26, 88-89, 
105-109, 184-185. 
choice of seed, 17, 20, 143, 184. 
period length, 16-23. 
subsequence of, 11, 73. 
Linear equations, 291-292. 
integer solution to, 343-345, 354. 
Linear factors mod p, 455. 
Linear iterative array, 313-317, 328. 
Linear lists, 279, 281, 283. 
Linear operators, 363-366, 376. 
Linear probing, 592. 
Linear recurrences, 29-32, 409-411, 695. 
mod m, 37—40. 
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Linearly independent vectors, 443—444, 
508, 659-660. 
Linked memory, 279, 281, 283, 419. 
Linking automaton, 311. 
Linnainmaa, Seppo Ilmari, 242, 244, 718. 
Liouville, Joseph, 378. 
Lipton, Richard Jay, 497, 675, 697. 
Liquid measure, 199. 
Little Fermat computer, 311. 
Littlewood, John Edensor, 382. 
LLL algorithm, 118, 417, 453. 
Local arithmetic, 200. 
Locally nonrandom behavior, 46, 
51-52, 152, 168. 
Lochs, Gustav, 372-373. 
Loewenthal, Dan (30919 71), 291. 
Logarithm, 279, 313. 
discrete, 417. 
of a matrix, 536. 
of a power series, 533, 537. 
of a uniform deviate, 133. 
of p, 283. 
Logarithmic integral, 381-382, 414, 663. 
Logarithmic law of leading digits, 
254-264, 282, 404. 
Logarithmic sums, 628-629. 
Logical operations, see Boolean operations. 
Loh, Giinter, 666. 
lomult, 15. 
Long division, 270-275, 278-279. 
Loos, Riidiger Georg Konrad, 435, 674. 
Lotti, Grazia, 500, 715. 
Lovász, László, 118, 417, 453. 
Lovelace, Augusta Ada Byron King, 
Countess of, 189. 
Loveland, Donald William, 178, 179, 183. 
Lubiw, Anna, 656. 
Lubkin, Samuel, 327. 
Luby, Michael George, 179. 
Lucas, François Edouard Anatole, 391, 
407, 409, 413, 414. 
numbers Ln, 695. 
Lukes, Richard Francis, 390. 
Lund, Carsten, 593. 
Lüscher, Martin, 35, 72, 109, 188, 
550, 556, 571. 
Luther, Herbert Adesla, 278. 


m-ary method of exponention, 464, 466, 
470-471, 481—482. 

Ma, Keju (fy Si), 673. 
Machine language versus higher-level 
languages, 16, 185. 
MacLaren, Malcolm Donald, 33, 47, 
128, 551, 585. 
MacMahon, Percy Alexander, 609. 
MacMillan, Donald Bashford, 226. 
MacPherson, Robert Duncan, 114. 
MacSorley, Olin Lowe, 280. 
Maeder, Roman Erich, 627, 635. 
Mahler, Kurt, 180. 

measure, 683. 
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Maiorana, James Anthony, 557. 

Mairan, Jean-Jacques d’Ortous de, 537. 
Makarov, Oleg Mikhailovich (Maxapos, 
Oner Muxaitnosu4), 700, 714. 
Mallows, Colin Lingwood, 74. 

Manasse, Mark Steven, 403. 

Manchester University Computer, 192. 
Mandelbrot, Benoit Baruch, 606. 
Mangoldt, Hans Carl Friedrich von, 663. 
function, 371, 376. 

MANIAC III computer, 242. 

Mansour, Yishay (Ns >W), 316. 
Mantel, Willem, 552. 

Mantissa, 214, see Fraction part. 
Marczyński, Romuald Władysław, 205. 
Mariage, Aimé, 201. 

Mark I computer (Ferranti), 3. 

Mark II Calculator (Harvard), 225. 
Marsaglia, George, 3, 23, 29, 33, 40, 47, 
62, 71, 72, 75, 78, 108, 114-115; 119, 
122, 123, 128, 133-135, 179, 544, 
546-547, 549, 551, 565, 588. 
Martin, Monroe Harnish, 32, 38, 40. 
Martin-Lof, Per Erik Rutger, 169-170, 178. 
Masking, 322, 328-329, 389-390, 671. 


Math. Comp.: Mathematics of Computation 


(1960—), a publication of the American 
Mathematical Society since 1965; 
founded by the National Research 
Council of the National Academy 

of Sciences under the original title 
Mathematical Tables and Other Aids 
to Computation (1943-1959). 

Mathematical aesthetics, 289. 

Matias, Yossi (DNDN >DY), 121. 

Matrix: A rectangular array, 486. 
characteristic polynomial, 499, 524. 
determinant, 356, 373, 432, 434, 

498-500, 523-524. 
greatest common right divisor, 438. 
inverse, 98, 331, 500, 524. 
multiplication, 499-501, 506-507, 
516, 520-523, 699. 
null space, 443-444, 456, 659-660, 681. 
permanent, 499, 515-516. 
rank, 443—444, 506, 508, 520. 
semidefinite, 586. 
singular, 98, 116, 513, 520. 
triangularization, 444, 659-660, 677. 

Matrix (Bush), Irving Joshua, 41, 280. 

Matsumoto, Makoto (BSANIEL), 29, 572, 604. 

Matthew, Saint (“Ayto¢ Matðaioç 

ô Edvayyedtotn¢), 735. 

Matula, David William, 210, 211, 329, 

332-333, 379. 

Mauchly, John William, 27. 

Maupertuis, Pierre-Louis Moreau de, 537. 

Maximum of random deviates, 122. 

Maximum-of-t test, 52, 54, 59, 70, 75, 

77, 122, 158, 180. 

Maya Indians, 196. 


Mayer, Dieter Heinz-Jorg, 366. 

Maze, Gérard, 645. 

McCarthy, Daniel Patrick, 696. 
McClellan, Michael Terence, 292. 
McCracken, Daniel Delbert, 226. 
McCurley, Kevin Snow, 661, 671. 
McEliece, Robert James, 687. 
McKendrick, Anderson Gray, 74. 

Mean, evaluation of, 232, 244. 

Measure, units of, 198-199, 201, 209, 
255, 326, 327. 

Measure theory, 160, 166-167, 178, 367. 
Mediant rounding, 331-332, 379. 
Meissel, Daniel Friedrich Ernst, 667. 
Mellin, Robert Hjalmar, transforms, 
355, 644. 

Mendelsohn, Nathan Saul, 211. 

Mendès France, Michel, 649, 656. 
Mental arithmetic, 279, 295. 

Merit, figure of, 105. 

Mersenne, Marin, 391, 407. 
multiplication, 294. 

numbers, 14, 409. 

primes, 185, 409, 412, 413. 

Mertens, Franz Carl Joseph, 641, 659. 
constant, 659. 

METAFONT, iv, vi, 764. 

METAPOST, vii, 764. 

Metrology, 201. 

Metropolis, Nicholas Constantine 
(MntednoAnc, Nixdraog Kwvotavtivov), 
4, 189, 240, 242, 327. 

Metze, Gernot, 280. 

Meyer, Albert Ronald da Silva, 634. 
Micali, Silvio, 179, 598. 

Michigan, University of, 242, 617. 
Middle-square method, 3—4, 7-8, 27, 36, 75. 
Midpoint, 244. 

Mignotte, Maurice, 450, 683. 
Mihăilescu, Preda-Mihai, 396. 

Mikami, Yoshio (= HK), 340, 486, 648. 
Mikusinski, Jan, 378. 

Miller, Gary Lee, 395-396. 

Miller, James (= Jimmy) Milton, 108. 
Miller, Jeffrey Charles Percy, 695. 
Miller, Kenneth William, 108. 

Miller, Victor Saul, 416. 

Miller, Webb Colby, 485. 
Milne-Thompson, Louis Melville, 505. 
Minimizing a quadratic form, 98-101, 
105, 115-118. 

Minimum polynomial, 711. 

Minkowski, Hermann, 579. 

Minus zero, 202, 244-245, 249, 268, 274. 
MIP-years, 176, 405. 

Miranker, Willard Lee, 242. 

Mises, Richard, Edler von, 149, 177, 494. 
Mitchell, Gerard Joseph Francis Xavier, 
21, 32. 


MIX computer, vi, 209. 
binary version, 202-204, 339, 
389-390, 481. 
floating point attachment, 215, 
223-225, 516. 
Mixed congruential method, 11, see Linear 
congruential sequence. 
Mixed-radix number systems, 66, 199, 
208-211, 290, 293, 505. 
addition and subtraction, 209, 281. 
balanced, 103, 293, 631. 
comparison, 290. 
counting by 1s, 103. 
multiplication and division, 209. 
radix conversion, 327. 
Mixture of distribution functions, 
123-124, 138. 
Mobius, August Ferdinand, function, 
354, 376, 456, 459. 
inversion formula, 456, 652. 
mod, 228, 421, 544, 734. 
mod m arithmetic, 
addition, 12, 15, 203, 287—288. 
division, 354, 445, 499; see also Inverse 
modulo m. 
halving, 293. 
multiplication, 12-16, 284, 287-288, 
294, 318, 663. 
on polynomial coefficients, 418—420. 
square root, 406—407, 415, 456—457, 
681-682. 
subtraction, 15, 186, 203, 287-288. 
Model V computer, 225. 
Modular arithmetic, 284—294, 302-305, 
450, 454, 499. 
complex, 292. 
Modular method for polynomial gcd, 
453, 460. 
Modulus in a linear congruential sequence, 
10-16, 23, 184. 
Moenck, Robert Thomas, 449, 505. 
Moews, David John, 593. 
Moivre, Abraham de, 537. 
Møller, Ole, 242. 
Monahan, John Francis, 130, 131, 135. 
Monic polynomial, 418, 420, 421, 425, 
435, 452, 457, 518. 
Monier, Louis Marcel Gino, 414, 662. 
Monkey tests, 75. 
Monomials, evaluation of, 485, 697. 
Monotonicity, 230, 243. 
Monte Carlo, 2, 29, 55, 114, 185, 189. 
Monte Carlo method: Any computational 
method that uses random numbers 
(possibly not producing a correct 
answer); see also Las Vegas algorithms, 
Randomized algorithms. 
Montgomery, Hugh Lowell, 683. 
Montgomery, Peter Lawrence, 284, 322. 
multiplication mod m, 284, 386, 396. 
Moore, Donald Philip, 27, 32. 
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Moore, Louis Robert, III, 108. 

Moore, Ramon Edgar, 242. 

Moore School of Electrical Engineering, 

208, 225. 

Morain, Francois, 390. 

Morgenstern, Jacques, 524. 

Morley, Frank Vigor, 199. 

Morris, Robert, 613. 

Morrison, Michael Allan, 396, 400, 660. 

Morse, Harrison Reed, III, 192. 

Morse, Samuel Finley Breese, code, 377. 

Moses, Joel, 454-455. 

Most significant digit, 195. 

Motzkin, Theodor (= Theodore) Samuel 

(PPSI INwwY WN N), 378, 490, 494, 

495, 497, 518, 519, 705. 

Muddle-square method, 36, 174-176, 179. 

Muller, Mervin Edgar, 122, 143. 

Multinomial coefficients, 539. 

Multinomial theorem, 722. 

Multiple-precision arithmetic, 58, 202, 

265-318, 419, 486. 
addition, 266-267, 276-278, 281, 283. 
comparison, 281. 
division, 270-275, 278-279, 282-283, 
311-313. 
greatest common divisor, 345-348, 
354, 355, 379, 656. 
multiplication, 268-270, 283, 294-318. 
radix conversion, 326, 328. 
subtraction, 267—268, 276, 281, 283. 
Multiple-precision constants, 352, 362, 366, 
384, 659, 663, 712, 726-728. 
Multiples, 422. 
Multiples of an irrational number mod 1, 
164, 379, 622. 

Multiplication, 194, 207—208, 265, 294, 462. 
complex, 205, 307-310, 487, 506, 519, 706. 
double-precision, 249-250, 252, 295. 
fast (asymptotically), 294-318. 
floating point, 220, 230-231, 243, 263-264. 
fractions, 282, 330. 
matrix, 499-501, 506-507, 516, 

520-523, 699. 
Mersenne, 294. 
mixed-radix, 209. 
mod m, 12-16, 284, 287—288, 294, 
318, 663. 
mod u(x), 446. 
modular, 285-286, 302-305. 
multiprecision, 268-270, 283, 294-318. 
multiprecision by single-precision, 281. 
polynomial, 418—420, 508, 512, 521, 
T12; e: 
power series, 525. 
two’s complement, 608. 
Multiplicative congruential method, 11, 
19-23, 185-186. 
Multiplier in a linear congruential sequence, 
10-11, 16-26, 88-89, 105-109, 184-185. 
Multiply-and-add algorithm, 268, 313. 
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Multiprecision: Multiple-precision or 
Arbitrary precision. 
Multiprimality: Total number of prime 
factors, 384. 
Multisets, 170, 473, 483. 
operations on, 483, 694-695. 
terminological discussion, 694. 
Multivariate polynomials, 418-419, 
422, 455, 518. 
chains, 497—498, 514. 
factors, 458. 
noncommutative, 436. 
roots of, 436. 
Munro, James Ian, 515, 706. 
Musical notation, 198. 
Musinski, Jean Elisabeth Abramson, 507. 
Musser, David Rea, 278, 453, 455. 


N-source, 170. 

Nadler, Morton, 627. 

Nance, Richard Earle, 189. 

Nandi, Salil Kumar (Hert Gea aA), 278. 

NaNs, 245, 246, 639. 

Napier, John, Laird of Merchiston, 194, 200. 

Narayana Pandita, son of Nrsimha 

(aman after, seer g7:), 387. 

Native American mathematics, 196. 

Needham, Noel Joseph Terence Montgomery 

(FIE), 287. 

Negabinary number system, 204—205, 

209-210, 212, 328. 

Negacyclic convolution, 521. 

Negadecimal number system, 204, 210. 

Negative binomial distribution, 140. 

Negative digits, 207-213, 696. 

Negative numbers, representation of, 

202-205, 275-276. 

Negative radices, 204-205, 209-210, 

212, 328. 

Neighborhood of a floating point 

number, 234. 

Neugebauer, Otto Eduard, 196, 225. 

Neumann, John von (= Margittai Neumann 

Janos), 1, 3-4, 26, 36, 119, 125, 128, 

138, 140, 202, 226, 278, 327. 

Newcomb, Simon, 255. 

Newman, Donald Joseph, 697. 

Newton, Isaac, 449, 486, 698, 701. 

interpolation formula, 503-505, 516. 

method for rootfinding, 278-279, 312, 
486, 529, 629, 719. 

Ni, Wen-Chun (XÆ), 121. 

Nicomachus of Gerasa (Nixdyaxo¢ 

ô èx Tepdowv), 659. 

Niederreiter, Harald Günther, 106-107, 109, 

113-115, 117, 161, 177, 584. 

Nijenhuis, Albert, 146. 

Nine Chapters on Arithmetic, 340. 

Nines, casting out, 289, 303, 324. 

Nines’ complement notation, 203, 210. 


Nisan, Noam (10) 099)), 316. 
Niven, Ivan Morton, 155-156. 
Nonary (radix 9) number system, 200, 637. 
Noncommutative multiplication, 436-438, 
500, 507, 672, 684. 
Nonconstructive proofs, 286, 289, 583. 
Nonnegative: Zero or positive. 
Nonsingular matrix: A matrix with nonzero 
determinant, 98, 116, 513, 520. 
Norm of a polynomial, 457—458. 
Normal deviates: Random numbers with the 
normal distribution, 122-132, 142. 
dependent, 132, 139. 
direct generation, 141. 
square of, 134. 
Normal distribution, 56, 122, 384, 565. 
tail of, 139. 
variations, 132, 139. 
ormal evaluation schemes, 506, 709-710. 
ormal numbers, 177. 
ormalization of divisors, 272-273, 282-283. 
ormalization of floating point numbers, 
215-217, 227-228, 238, 248-249, 
254, 616. 
Normand, Jean-Marie, 29. 
Norton, Graham Hilton, 373, 673. 
Norton, Karl Kenneth, 383. 
Norton, Victor Thane, Jr., 607. 
Notations, index to, 730-734. 
Nowak, Martin R., 409. 
Nozaki, Akihiro ((AHE5A), 524. 
NP-complete problems, 499, 514, 585, 698. 
Null space of a matrix, 443-444, 456, 
659-660, 681. 
Number field sieve, 403, 671. 
Number fields, 331, 333, 345, 403, 674. 
Number sentences, 605. 
Number system: A language for representing 
numbers. 
balanced binary, 213. 
balanced decimal, 211. 
balanced mixed-radix, 103, 293, 631. 
balanced ternary, 207—208, 209, 
227, 283, 353. 
binary (radix 2), 195, 198-206, 209-213, 
419, 461, 483. 
combinatorial, 209. 
complex, 205-206, 209-210, 292. 
decimal (= denary, radix ten), 197—199, 
210, 320-326, 374. 
duodecimal (radix twelve), 199-200. 
factorial, 66, 209. 
Fibonacci, 209. 
floating point, 196, 214-215, 222, 228, 246. 
hexadecimal (radix sixteen), 201-202, 
204, 210, 324, 639. 
mixed-radix, 66, 199, 208-211, 290, 
293, 505. 
modular, 284—285. 
negabinary (radix —2), 204-205, 
209-210, 212, 328. 
negadecimal, 204, 210. 


N 
N 
N 
N 


nonary (radix 9), 200, 637. 

octal (= octonary = octonal, radix 8), 
194, 200-202, 210, 228, 323-325, 
328, 481, 727. 

p-adic, 213, 605, 632, 685. 

phi, 209. 

positional, 151, 166-167, 177, 195-213, 
319-329. 

primitive tribal, 195, 198. 

quater-imaginary (radix 27), 205, 
209-210, 283. 

quaternary (radix 4), 195, 200. 

quinary (radix 5), 195, 200, 213. 

rational, 330, 420. 

regular continued fraction, 346, 358-359, 
368, 374-379, 412, 415, 665. 

reversing binary, 212. 

revolving binary, 212. 

senary (radix 6), 200. 

senidenary (= hexadecimal), 202. 

septenary (radix 7), 200. 

sexagesimal (radix sixty), 196-200, 
225, 326. 

slash, 331-333, 379. 


ternary (radix 3), 195, 200, 204, 213, 328. 


vigesimal (radix twenty), 196. 
Numerical instability, 245, 292, 485, 
489, 490. 
Nunes (= Nufiez Salaciense = Nonius), 
Pedro, 424. 
Nussbaumer, Henri Jean, 521, 710. 
Nystrom, John William, 201. 


Octal (radix 8) number system, 194, 
200-202, 210, 228, 323-325, 328, 
481, 727. 

Octavation, 326. 

Odd-even method, 128-130, 139. 

Odlyzko, Andrew Michael, 416, 541, 
608, 667, 671. 

OFLO, 218. 

Oldham, Jeffrey David, vii. 

Oliver, Ariadne, 725. 

Oliveira e Silva, Tomás António Mendes, 
386, 667. 

Olivos Aravena, Jorge Augusto Octavio, 
485, 698. 

One-way function, 172, 179. 

Ones’ complement notation, 12, 203-204, 
275-276, 279, 288, 544. 

Online algorithms, 318, 525-526, 720. 

Operands: Quantities that are operated on, 
such as u and v in the calculation 
of u+v. 

Ophelia, daughter of Polonius, v. 

Optimum methods of computation, 
see Complexity. 

OR (bitwise or), 140, 686, 695. 

Order of a modulo m, 20-23, 391-392. 

Order of an element in a field, 457. 

Order of magnitude zero, 239. 
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Order statistics, 135. 
Ordered hash table, 592. 
Organ-pipe order, 378. 
Oriented binary tree, 692. 
Oriented tree, 9, 464-465, 481—482. 
Ostrowski, Alexander Markus, 494. 
Oughtred, William, 225, 326. 
Overflow, 12-13, 252, 267, 293, 332, 
543, 639. 
exponent, 217, 221, 227, 231, 241, 
243, 249. 
fraction, 217, 254, 262, 264. 
rounding, 217, 220, 222, 224, 227—228. 
Overstreet, Claude Lee, Jr., 189. 
Owen, John, 1. 
Owings, James Claggett, Jr., 178. 
Ozawa, Kazufumi ¥— X), 615. 


p-adic numbers, 213, 605, 632, 685. 

Packing, 109. 

Padé, Henri Eugène, 357, 534. 

Padegs, Andris, 226. 

Pairwise independence, 183, 668-669. 

Palindromes, 415. 

Palmer, John Franklin, 222. 

Pan, Victor Yakovlevich (Ian, Buxrop 
Axosznesnu), 490, 492, 497, 500, 505, 
507, 515, 517, 519, 521, 677, 699, 703, 
705, 706, 714, 715, 721. 

Panario Rodriguez, Daniel Nelson, 449. 

Pandu Rangan, Chandrasekaran 
(¢65oCeagosr rris), 717. 

Papadimitriou, Christos Harilaos 
(Tlanadrnuntetov, Xptoto¢ Xaprrc&ov), 697. 

Pappus of Alexandria (IIknnoc¢ 
ô ‘AdeEavOewvdc), 225. 

Paradox, 257. 

Parallel computation, 286, 317, 488, 503. 

Parameter multiplications, 518, 524. 

Parameter step, 494, 518. 

Pardo, see Trabb Pardo. 

Park, Stephen Kent, 108. 

Parlett, Beresford Neill, 194. 

Parry, William, 209. 

Partial derivatives, 524. 

Partial fraction expansion, 85, 510, 685. 

Partial ordering, 694. 

Partial quotients, 87, 106, 117, 346, 359, 
367-369, 379, 656. 

distribution of, 362-369, 665. 

Partition test, 63-64, 74, 158. 

Partitions of a set, 64, 722. 

Partitions of an integer, 79, 146. 

Pascal, Blaise, 199. 

Pascal-SC language, 242. 

Patashnik, Oren, 741. 

Paterson, Michael Stewart, 519, 634, 707. 

Patience, 190. 

Patterson, Cameron Douglas, 390. 

Paul, Nicholas John, 128. 

Pawlak, Zdzistaw, 205, 627. 
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Payne, William Harris, 32. 
Paz, Azaria (19 MY), 498. 
Peano, Giuseppe, 201. 
Pearson, Karl, 55, 56. 
Peirce, Charles Santiago Sanders, 538. 
Pemantle, Robin Alexander, 542. 
Penk, Michael Alexander, 646. 
Penney, Walter Francis, 206. 
Pentium computer chip, 280, 409. 
Percentage points, 44, 46, 51, 70-71, 383. 
Percival, Colin Andrew, 632. 
Perfect numbers, 407. 
Perfect squares, 387—388. 
Period in a sequence, 7—9. 
length of, 4, 16-23, 37—40, 95. 
Periodic continued fraction, 375, 415. 
Permanent, 499, 515-516. 
Permutation: An ordered arrangement 
of a set. 
mapped to integers, 65-66, 77-78, 145. 
random, 145-148, 384, 460, 679. 
Permutation test, 65-66, 77—78, 80-81, 
91, 154. 
Perron, Oskar, 356, 460, 690. 
Persian mathematics, 197, 326, 462. 
Pervushin, Ivan Mikheevich (Ilepsymmun, 
Msan Muxeesus), 407. 
Pethő, Attila, 607. 
Petkovsek, Marko, 608. 
Petr, Karel, 442. 
Pfeiffer, John Edward, 192. 
Phalen, Harold Romaine, 200. 
Phi (¢), 164, 209, 283, 359, 360, 514, 
652, 726-727, 733. 
Phillips, Ernest William, 201—202. 
Pi (7), 41, 151, 158, 161, 198, 200, 209, 
279-280, 284, 358, 726-727, 733. 
as “random” example, 21, 25, 33, 47, 52, 
89, 103, 106, 108, 184, 238, 243, 252, 
324-325, 555, 593, 599, 665. 
Picutti, Ettore, 412. 
Pigeonhole principle, 286. 
Pingala, Acarya (Tare faga), 461. 
Pipeline, 283. 
Pippenger, Nicholas John, 481, 697. 
Piras, Francesco, 683. 
Pitfalls of random number generation, 
6, 29, 88, 188-189. 
Pitteway, Michael Lloyd Victor, 653. 
Places, 265. 
Planck, Karl Ernst Ludwig Marx (= Max), 
constant, 214, 227, 238, 240. 
Plauger, Phillip James, 327. 
Playwriting, 190-192. 
Plouffe, Simon, 284. 
PM system, 420. 
Pocklington, Henry Cabourn, 414, 681. 
Pointer machine, 311, 317, 634. 
Poirot, Hercule, 725. 
Poisson, Siméon Denis, distribution, 55, 
137-138, 140, 141, 538, 570. 


Poker test, 63-64, 74, 158. 
Polar coordinates, 56, 59, 123. 
Polar method, 122-123, 125, 135. 
Pollard, John Michael, 306, 385-386, 
402-403, 413, 417, 658, 711. 
Pólya, György (= George), 65, 569. 
Polynomial, 418—420, 486. 
addition, 418—420. 
arithmetic modulo m, 37—40, 
419-420, 464. 
degree of, 418, 420, 436. 
derivative of, 439, 489, 524, 537. 
discriminant of, 674, 686. 
distribution function, 138. 
division, 420—439, 487, 534. 
evaluation, 485-524. 
factorization, 439-461, 514. 
greatest common divisor, 424—439, 
460, 453-455. 
interpolation, 297, 365, 503-505, 509, 
516, 700, 721. 
irreducible, 422, 435, 450, 456—457, 460. 
leading coefficient, 418, 451—452, 454. 
monic, 418, 420, 421, 425, 435, 452, 
457, 518. 
multiplication, 418-420, 508, 512, 
521, 712, 713. 
multivariate, 418-419, 422, 455, 518. 
norms, 457—458. 
over a field, 420—425, 435, 439-449, 
455-459. 
over a unique factorization domain, 
421—439, 449-461. 
primitive, 422, 436. 
primitive modulo p, 30-32, 422. 
primitive part, 423-425. 
random, 435, 448, 455, 459. 
remainder sequence, 427—429, 438, 
455, 721. 
resultant, 433, 674, 690. 
reverse of, 435, 452, 673, 721. 
roots of, 23, 434, 436, 483, 493. 
sparse, 455, 672. 
squarefree, 439, 456, 459. 
string, 436-438. 
subtraction, 418—420. 
Polynomial chains, 494—498, 517-524. 
Pomerance, Carl, 396, 402, 659. 
Poorten, Alfred Jacobus van der, 656. 
Pope, Alexander, 88. 
Pope, David Alexander, 278. 
Popper, Karl Raimund, 178. 
Portable random number generators, 
185-188, 193. 
Porter, John William, 372. 
Positional representation of numbers, 151, 
166-167, 177, 195-213, 319-329. 
Positive definite quadratic form, 98, 115. 
Positive operator, 365. 
Positive semidefinite matrix, 586. 


Potency, 24-26, 36, 47, 52, 73, 83, 87-88, 
92, 105, 184. 
Power matrix, 534-536. 
Power series: A sum of the form 
eso apz", see Generating functions. 
manipulation of, 525-537. 
Power tree, 464, 481. 
Poweroids, 534-536, 722. 
Powers, Donald (= Don) Michael, 312. 
Powers, evaluation of, 461—485. 
multiprecision, 463. 
polynomial, 463—464. 
power series, 526, 537, 719. 
Powers, Ralph Ernest, 396, 407. 
pp: Primitive part, 423—425. 
Pr: Probability, 150, 152, 168, 179-180, 
257, 264, 472, 734. 
Pratt, Vaughan Ronald, 356, 413. 
Precision: The number of digits in a 
representation. 
double, 246-253, 278-279, 295. 
multiple, 58, 202, 265-318, 419, 486. 
quadruple, 253, 295. 
single: fitting in one computer word, 215. 
unlimited, 279, 283, 331, 416, see also 
Multiple-precision. 
Preconditioning, see Adaptation. 
Prediction tests, 171, 183. 
Preston, Richard McCann, 280. 
Primality testing, 380, 391-396, 
409-414, 549. 
Prime chains, 415, 666. 
Prime numbers: Integers greater than unity 
having no proper divisors, 380. 
distribution of, 381-382, 405. 
enumeration of, 381-382, 416. 
factorization into, 334. 
largest known, 407—412. 
Mersenne, 185, 409, 412, 413. 
size of mth, 665. 
useful, 291, 405, 407—408, 549-550, 711. 
useless, 415. 
verifying primality of, 380, 391-396, 
409-414, 549. 
Primes in a unique factorization domain, 
421-422. 
Primitive element modulo m, 20-23. 
Primitive notations for numbers, 195, 198. 
Primitive part of a polynomial, 423—425. 
Primitive polynomial, 422, 436. 
Primitive polynomial modulo p, 30-32, 422. 
Primitive recursive function, 166. 
Primitive root: A primitive element 
in a finite field, 20-23, 185, 391, 
417, 456, 457. 
Priority sampling, 148. 
Pritchard, Paul Andrew, 631. 
Probabilistic algorithms, see Randomized 
algorithms. 
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Probability: Ratio of occurrence, 150, 

E77; 257: 
abuse of, 433. 
over the integers, 150, 152, 257, 264, 472. 

Probert, Robert Lorne, 699. 

Programming languages, 16, 185, 222. 

Pronouncing hexadecimal numbers, 201. 

Proof of algorithms, 281-282, 336-337, 592. 

Proofs, constructive versus nonconstructive, 
286, 289, 583, 630. 

Proper factor of v: A factor that is neither 
a unit nor a unit multiple of v. 

Proth, François Toussaint, 663. 

Proulx, René, 179. 

Pseudo-division of polynomials, 425—426, 
435-436. 

Pseudorandom sequences, 4, 170-176, 179. 

Ptolemy, Claudius (IItoàeuaioç KAatéroc), 
197. 

Public key cryptography, 406. 

Purdom, Paul Walton, Jr., 541. 

Pyke, Ronald, 566. 


q-series, 536. 

Quadratic congruences, solving, 417. 

Quadratic congruential sequences, 26-27, 37. 

Quadratic forms, 98, 521. 

minimizing, over the integers, 98-101, 
105, 115-118. 

Quadratic irrationalities, continued 
fractions for, 358, 374-375, 397—401, 
412, 415, 665. 

Quadratic reciprocity law, 393, 411, 

414, 663. 

Quadratic residues, 415, 697. 

Quadratic sieve method, 402. 

Quadruple-precision arithmetic, 253, 295. 

Quandalle, Philippe, 710. 

Quasirandom numbers, 4, 189. 

Quater-imaginary number system, 205, 
209-210, 283. 

Quaternary number system, 195, 200. 

Quick, Jonathan Horatio, 77, 147. 

Quinary number system, 195, 200, 213. 

Quolynomial chains, 498, 524, 704-705. 

Quotient: |u/v], 265, see Division. 

of polynomials, 420-421, 425—426, 534. 

partial, 87, 106, 117, 346, 359, 362-369, 
379, 656, 665. 

trial, 270-272, 278, 282. 


Rabin, Michael Oser (7739 WHY IND), 175, 
396, 406, 413, 415, 448, 449, 707. 
Rabinowitz, Philip, 279. 
Rackoff, Charles Weill, 179. 
Rademacher, Hans, 90, 91. 
Radioactive decay, 7, 132, 137. 
Radix: Base of positional notation, 195. 
complex, 205-206, 209-210. 
irrational, 209. 
mixed, 66, 199, 208-211, 290, 293, 505. 
negative, 204-205, 209-210, 212, 328. 
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Radix conversion, 200, 204, 205, 207, 

210, 319-329, 486, 489. 
floating point, 326-329. 
multiprecision, 326, 328. 

Radix point, 10, 185, 195, 204, 209, 214, 319. 

Raimi, Ralph Alexis, 257, 262. 

Raleigh (= Ralegh), Walter, 199. 

Rall, Louis Baker, 240, 242. 

Ramage, John Gerow, 135. 

Ramanujan Iyengar, Srinivasa (uthesfeuren 
TITLO ITED Nor SQusmIK TT), 662. 

Ramaswami, Vammi (bið giripe i), 383. 

Ramshaw, Lyle Harold, 164, 181. 

ran_array, 186-188, 193. 

RAND Corporation, 3. 

Randell, Brian, 202, 225. 

Random bits, 12, 30-32, 35-36, 38, 48, 
119-120, 170-176. 

Random combinations, 142-148. 

Random directions, 135-136. 

Random fractions, 10, see Uniform deviates. 

Random functions, 4-9, 385. 

Random integers, 

among all positive integers, 257, 
264, 342, 354. 
in a bounded set, 2-3, 6-7, 119-121, 
138, 162-163, 185-186. 
Random mappings, 4-9, 385, 657—658. 
Random number generators, 1-193. 
for nonuniform deviates, 119-148. 
for uniform deviates, 10—40, 184-189, 193. 
machines, 3, 404. 
summary, 184-193. 
tables, 3. 
testing, 41-118. 
using, 1-2, 119-148, 664, see also 
Randomized algorithms. 
Random permutations, 145-148, 384, 
460, 679. 
of a random combination, 148. 
Random point, in a circle, 123. 
in a sphere, 136. 
on an ellipsoid, 141. 
on a sphere, 135. 

Random polynomials, 435, 448, 455, 459. 

Random random number generators, 
6-9, 26. 

Random real numbers, 255, 263. 

Random samples, 142-148. 

Random sequences, meaning of, 2, 149-183. 

finite, 167-176, 178-179, 183. 

Random walk test, 34, 79. 

Randomized algorithms: Algorithms that 
use random numbers and usually 
produce a correct answer, 1—2, 171, 
395-396, 401—402, 413-417, 436, 
447-449, 453, 459, 669, 688. 

Randomness, guaranteed, 35-36, 170-176. 

RANDU, 26, 107, 188, 551. 

Rangan, see Pandu Rangan. 


Range arithmetic, 228, 240—242, 244-245, 
333, 613. 

Rank, of apparition, 410—411. 

of a matrix, 443-444, 506, 508, 520. 
of a tensor, 506, 508, 514, 520-524. 

RANLUX, 109. 

Rap music, 3. 

Rapoport, Anatol, 541. 

Ratio method, 130-132, 133, 140. 

Rational arithmetic, 69, 330-333, 
427-428, 526. 

Rational function approximation, 
438-439, 534. 

Rational functions, 420, 498, 518. 

approximation and interpolation, 
438-439, 505, 534. 

Rational numbers, 330, 420, 459. 
approximation by, 331-332, 378-379, 617. 
mod m, 379. 
polynomials over, 428, 459. 
positional representation of, 16, 

211, 213, 328. 

Rational reconstruction, 379. 

Real numbers, 420. 

Real time, 286. 

Realization of a tensor, 507. 

Reciprocal differences, 505. 

Reciprocals, 278-279, 312-313, 421. 
floating point, 243, 245, 263. 
mod 2°, 213, 629. 
mod m, 26, 213, 354, 445, 456, 646. 
power series, 537. 

Reciprocity laws, 84, 90, 393, 414. 

Recorde, Robert, xi, 280-281. 

Rectangle-wedge-tail method, 123-128, 139. 

Rectangular distribution, see Uniform 
distribution. 

Recurrence relations, 10, 26-33, 37—40, 
260-261, 295, 301-302, 313, 318, 

351, 362, 386, 409-411, 442, 525, 
634, 687, 695, 714. 

Recursive processes, 253, 295, 299-303, 
318, 419, 500, 531, 689, 713. 

Reeds, James Alexander, III, 599. 

Rees, David, 39, 169. 

Registers, 491. 

Regular continued fractions, 346, 358-359, 
368, 374-379, 412, 415, 665. 

Rehkopf, Donald Caspar, 41. 

Reiser, John Fredrick, 28, 39, 242. 

Reitwiesner, George Walter, 213, 280. 

Rejection method, 125-126, 128-129, 

134, 138, 139, 591. 

Relative error, 222, 229, 232, 253, 255. 

Relatively prime: Having no common prime 
factors, 11, 19, 286, 330, 332, 342, 354. 

polynomials, 422, 436, 454. 

Remainder: Dividend minus quotient times 
divisor, 265, 272-273, 420—421, 425-426, 
437, 534, see also mod. 

Replicative law, 90. 


Representation of numbers, see Number 
systems. 

Representation of trees, 482. 

Representation of 00, 225, 244-245, 332. 

Reservoir sampling, 143-144, 147, 148. 

Residue arithmetic, 284—294, 302-305, 
450, 454, 499. 

Result set, 494, 517. 

Resultant of polynomials, 433, 674, 690. 

Revah, Ludmila, 706. 

Reverse of a polynomial, 435, 452, 673, 721. 

Reversing binary number system, 212. 


Reversion of power series, 527-530, 533-536. 


Revolving binary number system, 212. 

Rezucha, Ivan, 143. 

Rhind papyrus, 462. 

Rho method for factoring, 384-386, 
393-394, 413. 

Riccati, Jacopo Francesco, equation, 650. 

Rieger, Georg Johann, 653. 

Riemann, Georg Friedrich Bernhard, 
83, 382, 414. 

hypothesis, 382, 663. 
hypothesis, generalized 395-396, 671. 
integration, 153-154, 259. 

Riffle shuffles, 145, 147. 

Right divisor, greatest common, 437—438. 

Ring with identity, commutative, 418. 

Riordan, John, 542. 

Rising powers, 534, 731. 

Ritzmann, Peter, 721. 

Rivat, Joél, 667. 

Rivest, Ronald Linn, 403, 405, 707. 

Robber, 190-192. 

Robbins, David Peter, 593. 

Robinson, Donald Wilford, 554. 

Robinson, Julia Bowman, 666. 

Robinson, Raphael Mitchel, 664, 711. 

Roepstorff, Gert, 366. 

Rolletschek, Heinrich Franz, 9, 345. 

Roman numerals, 195, 209. 

Romani, Francesco, 500, 715. 

Roof, Raymond Bradley, 115. 

Roots of a polynomial, 23, 434, 483, 493. 

multivariate, 436. 

Roots of unity, 84, 531-532, 700; see 
also Cyclotomic polynomials, 
Exponential sums. 

Rosińska, Izabela Grażyna, 198. 

Ross, Douglas Taylor, 192. 

Rotenberg, Aubey, 11, 47. 

Rothe, Heinrich August, 535. 

Rouché, Eugène, theorem, 690. 

Roulette, 2, 10, 55. 

Round to even, 237, 241, 245. 

Round to odd, 237. 

Rounding, 102, 207, 217, 222, 223, 
230-231, 236-237. 

mediant, 331-332, 379. 
Rounding errors, 232, 242, 698, 718. 
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Rounding overflow, 217, 220, 222, 
224, 227-228. 

Rozier, Charles Preston, 324. 

RSA box, 404, 406. 

RSA encryption, 403-407, 415, 629, 669. 

Rudolff, Christof, 198. 

Ruler function p(n), 540. 

Run test, 63, 66-69, 74-77, 158, 180. 

Runs above (or below) the mean, 63. 

Runs in a permutation, 66, 74, 76. 

Russian peasant method, 462. 

Ruzsa, Imre Zoltan, 213. 

Ryser, Herbert John, 515, 699. 


$n, 170. 

Saarinen, Jukka Pentti Päiviö, 75. 

Sachau, Karl Eduard, 461. 

Saddle point method, 568. 

Sahni, Sartaj Kumar, 60. 

Saidan, Ahmad Salim (ol) alls soa), 
198, 461. 

Salamin, Eugene, 283. 

Salfi, Robert, 145. 

Samelson, Klaus, 241—242, 327. 

Samet, Paul Alexander, 321. 

Sampling (without replacement), 1, 142-148. 

Sands, Arthur David, 610. 

Saunders, Benjamin David, 455. 

Savage, John Edmund, 707. 

Sawtooth function ((x)), 81-82, 90-91. 

Saxe, James Benjamin, 141. 

Saxena, Nitin (faa wear), 396. 

Scarborough, James Blaine, 241. 

Schatte, Peter, 262, 622. 

Schelling, Hermann von, 65. 

Schmid, Larry Philip, 73. 

Schmidt, Erhard, 101, 674. 

Schmidt, Wolfgang M, 183. 

Schnorr, Claus-Peter, 118, 179, 414, 417, 
497, 578, 664, 669. 

Scholz, Arnold, 478. 

Scholz—Brauer conjecture, 478—479, 485. 

Schönemann, Theodor, 457, 685. 

Schénhage, Arnold, 292, 302-303, 305, 306, 
311, 317, 328, 470, 484, 500, 522, 629, 
638, 656, 672, 696, 715. 

Schénhage-Strassen algorithm, 306-311, 317. 

Schooling, William, 627. 

Schreyer, Helmut, 202. 

Schréder, Friedrich Wilhelm Karl Ernst, 531. 

function, 531-532, 724. 

Schroeppel, Richard Crabtree, 399, 400, 671. 

Schubert, Friedrich Theodor von, 450. 

Schwartz, Jacob Theodore, 674, 675. 

Schwarz (= Svarc), Stefan, 442. 

Schwenter, Daniel, 654. 

Secrest, Don, 279, 327. 

Secret keys, 193, 403-407, 415, 417, 505. 

Secure communications, 2, 403—407, 415. 

Sedgewick, Robert, 540. 
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Seed (starting value), 143, 146, 170, 
187—188, 193, 550, 590. 
in a linear congruential sequence, 
10, 17, 20, 184. 
Seidenberg, Abraham, 198. 
Selection sampling, 142-143, 146. 
Selenius, Clas-Olof, 648. 
Self-reproducing numbers, 6, 293-294, 540. 
Selfridge, John Lewis, 394, 396, 412, 665. 
Semi-online algorithm, 529. 
Semigroup, 539. 
Seneschal, David, 589. 
Septenary (radix 7) number system, 200. 
Serial correlation coefficient, 77. 
Serial correlation test, 72—74, 91, 83, 
154, 182. 
Serial test, 39, 60, 62, 74, 75, 78, 95, 
106, 109-115, 158. 
Seroussi Blusztein, Gadiel (030 9N77)), 712. 
Serret, Joseph Alfred, 374, 449, 579. 
Sethi, Ravi, 485. 
SETUN computer, 208. 
Sexagesimal number system, 196-200, 
225, 326. 
Seysen, Martin, 118. 
Shafer, Michael William, 409. 
Shakespeare (= Shakspere), William, v. 
Shallit, Jeffrey Outlaw, 360, 378, 380, 390, 
395-396, 645, 646, 656, 663, 689. 
Shamir, Adi (nw >TY), 403, 405, 416, 
505, 599, 669. 
Shand, Mark Alexander, 629. 
Shanks, Daniel Charles, 280, 379, 681-682. 
Shanks, William, 279-280. 
Shannon, Claude Elwood, Jr., 211. 
Shaw, Mary Margaret, 489, 498, 515. 
Shen, Kangshen (7-4), 287. 
Sheriff, 190-192. 
Shibata, Akihiko (2FAHG2), 280. 
Shift operators of MIX, 339. 
Shift register recurrences, 27—32, 36—40, 
186-188, 193. 
Shift-symmetric N-source, 172, 183. 
Shirley, John William, 199. 
Shokrollahi, Mohammad Amin 
(cpg! Sad Gol sons), 515. 
Short vectors, 98-101, 118. 
Shoup, Victor John, 449, 687. 
Shub, Michael Ira, 36. 
Shuffled digits, 141. 
Shuffling a sequence, 33-36, 38, 39. 
Shuffling cards, 145-147. 
Shukla, Kripa Shankar (4T Wat 
FAT), 208, 648. 
Sibuya, Masaaki (74 BYHA), 133. 
SICOMP: SIAM Journal on Computing, 
published by the Society for Industrial 
and Applied Mathematics since 1972. 
Sideways addition, 463, 466. 
Sierpinski, Wacław Franciszek, 666. 
Sieve methods, 389-391, 402—403, 412. 


Sieve (xdoxwov) of Eratosthenes, 412, 
416, 667. 

Sieveking, Malte, 720. 

Signatures, digital, 406. 

Signed magnitude representation, 202-203, 
209-210, 247, 266. 

Significant digits, 195, 229, 238. 

Sikdar, Kripasindhu ( Penis), 327. 

Silverman, Joseph Hillel, 402. 

Simplex, recursively subdivided, 567. 

Simulation, 1. 

Sinclair, Alistair, 699. 

Sine, 490. 

Singh, Avadhesh Narayan (A@e3r ARTA 


fae), 343, 461. 

Singh, Parmanand (aTa fae), 387. 

Sink vertex, 480. 

SKRZAT 1 computer, 205. 

Slash arithmetic, 331-333, 379. 

SLB (shift left rAX binary), 339, 340. 

Slide rule, 225. 

Sloane, Neil James Alexander, 109. 

Slowinski, David Allen, 409. 

Small step, 467. 

Smirnov, Nikolai Vasilievich (Cmupuos, 
Huxonait Bacunpesma), 57. 

Smith, David Eugene, 197, 198. 

Smith, David Michael, 275, 279. 

Smith, Edson McIntyre, 409. 

Smith, Henry John Stephen, 646. 

Smith, James Everett Keith, 27. 

Smith, Robert LeRoy, 228. 

Sobol, Ilya Meerovich (Co6on», Una 
Meeposmay), 541. 

SODA: Proceedings of the ACM-SIAM 
Symposia on Discrete Algorithms, 
inaugurated in 1990. 

Soden, Walter, 323. 

Solitaire, 190. 

Solomonoff, Ray Joseph, 178. 

Solovay, Robert Martin, 396, 414. 

Sorenson, Jonathan Paul, 646. 

Sorted uniform deviates, 57, 71, 135, 
137, 141. 

Source vertex, 480. 

Sowey, Eric Richard, 189. 

Space-filling curves, 495. 

Spacings, 71, 78-79, 181. 

Sparse polynomials, 455, 672. 

Specht, Wilhelm, 683. 

Species of measure zero, 179. 

Spectral test, 30, 35, 93-118, 169, 184. 

algorithm for, 101—104. 
examples, 105-109. 
generalized, 108, 117. 

Spence, Gordon McDonald, 409. 

Spencer Brown, David John, 695. 

Sphere, n-dimensional, 56. 

random point in, 136. 
random point on, 135. 
volume of, 105. 


Spherical coordinates, 59. 
SQRT box, 175, 406—407, 415. 
Square root, 122, 213, 283, 374-375, 
397-398, 483. 
modulo m, 406-407, 415. 
modulo p, 456—457, 681-682. 
of power series, 526, 537. 
of uniform deviate, 122. 
Squarefree factorization, 460. 
Squarefree polynomials, 439, 456, 459. 
Squares, sum of two, 579-580. 
Squeamish ossifrage, 417. 
Squeeze method, 125-126, 147. 
SRB (shift right rAX binary), 339, 340, 481. 
Stability of polynomial evaluation, 
485, 489, 490. 
Stack: Linear list with last-in-first-out 
growth pattern, 299-301. 
Stahnke, Wayne Lee, 31. 


Standard deviation, evaluation of, 232, 244. 


Stanley, Richard Peter, 594. 

Star chains, 467, 473-477, 480, 482. 

Star step, 467. 

Stark, Richard Harlan, 226. 

Starting value in a linear congruential 
sequence, 10, 17, 20, 184. 

Statistical tests, 171, see Testing. 

Steele, Guy Lewis, Jr., 635-636, 638. 

Stefanescu, Doru, 450. 

Steffensen, Johan Frederik, 722. 

Stegun, Irene Anne, 44. 

Stein, Josef, 338. 

Stein, Marvin Leonard, 278. 

Stern, Moritz Abraham, 654. 

Stern—Brocot tree, 378, 656. 

Stevin, Simon, 198, 424. 

Stibitz, George Roberto, 202, 225. 

Stillingfleet, Edward, 537. 

Stirling, James, 537. 

approximation, 59. 
numbers, 64—65, 298, 534-535, 542, 
680, 732. 

STOC: Proceedings of the ACM 
Symposia on Theory of Computing, 
inaugurated in 1969. 

Stockmeyer, Larry Joseph, 519, 634, 707. 

Stoneham, Richard George, 115. 

Stoppard, Tom (= Straussler, Tomas), 61. 

Storage modification machines, 311. 

Strachey, Christopher, 192. 

Straight-line program, 494. 

Strassen, Volker, 306, 311, 317, 396, 414, 
497, 500, 507, 521, 523, 656, 708, 718. 

Straus, Ernst Gabor, 378, 485. 

Strindmo, Odd Magnar, 409. 

String polynomials, 436—438. 

Stringent tests, 75. 

Stroud, Arthur Howard, 279, 327. 

Struve, Wassilij Wassiliewitsch (Crpyse, 
Bacusmit Bacunpesu4), 462. 
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Student (= William Sealy Gosset), 
t-distribution, 135. 
Sturm, Jacques Charles Francois, 434, 
438, 674. 
Subbarao, Mathukumalli Venkata 
(HHHH Bose Leroy), 469. 
Subexponential (nice) functions, 694. 
Subnormal floating point numbers, 246. 
Subresultant algorithm, 428—436, 438, 455. 
Subsequence rules, 161-162, 168-169, 
177-178, 182. 
Subsequence tests, 73, 158. 
Subsequences, 40, 193. 
Subset FORTRAN language, 600. 
Subtract-and-shift cycle, 338, 348. 
Subtract-with-borrow sequence, 23, 35, 
72, 75, 108, 193, 546. 
Subtraction, 194, 207, 213, 265, 
267-268, 281. 
complex, 487. 
continued fractions, 649. 
double-precision, 247—249. 
floating point, 216, 230-231, 235-238, 
245, 253, 556, 602. 
fractions, 330-331. 
mod m, 15, 186, 203, 287—288. 
modular, 285-286. 
multiprecision, 267—268, 276, 281, 283. 
polynomial, 418—420. 
power series, 525. 
Subtractive random number generator, 
39-40, 186-188, 193. 
Sugunamma, Mantri (wi %@Govrbor), 
469. 
Sukhatme, Pandurang Vasudeo (18g %T 
), 568. 
Sum of periodic sequences, mod m, 
35, 38, 78, 108. 
Summation by parts, 643. 
Sun Tsti (= Stinzi, Master Sun) (T), 
280, 287. 
Sun SPARCstation, 764. 
Suokonautio, Vilho, 279. 
Svoboda, Antonin, 282, 292. 
Swarztrauber, Paul Noble, 634. 
Swedenborg, Emanuel, 200. 
Sweeney, Dura Warren, 253, 379. 
Swinnerton-Dyer, Henry Peter Francis, 681. 
Sykora, Ondrej, 700. 
Sylvester, James Joseph, matrix, 433, 
436, 674. 
Szabó, József, 607. 
Szabó, Nicholas Sigismund, 291, 292. 
Szekeres, György (= George), 570. 
Szymanski, Thomas Gregory, 540. 


t-ary trees, 723. 

Tabari, Mohammed ben Ayyūb 

Tables of fundamental constants, 
358-359, 726-729. 

Tabulating polynomial values, 488, 515. 

Tague, Berkley Arnold, 419. 
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Tail of a floating point number, 235. 
Tail of the binomial distribution, 167. 
Tail of the normal distribution, 139. 
Takahashi, Daisuke (Ey#S AJP), 280. 
Takahasi, Hidetosi (85t), 291. 
Tamura, Yoshiaki (Ht REH), 280. 
Tanaka, Richard Isamu (HFA) J4-7—F 53), 
292. 
Tangent, 376. 
tanh, 375. 
Tannery, Jules, 241. 
Taranto, Donald Howard, 327, 635. 
Tarski (Tajtelbaum), Alfred, 718. 
Tate, John Torrence, Jr., 402. 
Tate, Stephen Ralph, 309. 
Taussky Todd, Olga, 35, 106. 
Tausworthe, Robert Clem, 31. 
Taylor, Alfred Bower, 201. 
Taylor, Brook, theorem, 489. 
Taylor, William Johnson, 504. 
Television script, 190-192. 
Ten’s complement notation, 203, 210. 
Tensors, 506-514, 520-524. 
Term: A quantity being added. 
Terminating fractions, 328. 
Ternary number system, 195, 200, 
204, 213, 328. 
balanced, 207-208, 209, 227, 283, 353. 
Testing for randomness, 41—118. 
a priori tests, 80. 
chi-square test, 42—47, 53-56, 58-60. 
collision test, 70-71, 74, 158. 
coupon collector’s test, 63-65, 74, 
76, 158, 180. 
empirical tests, 41, 61-80. 
equidistribution test, 61, 74, 75. 
frequency test, 61, 74, 75. 
gap test, 62-63, 74-76, 136, 158, 180. 
Kolmogorov—Smirnov test, 48—60. 
maximum-of-t test, 52, 54, 59, 70, 75, 
77, 122, 158, 180. 
partition test, 63-64, 74, 158. 
permutation test, 65-66, 77—78, 
80-81, 91, 154. 
run test, 63, 66-69, 74-77, 158, 180. 
serial correlation test, 72—74, 91, 
83, 154, 182. 
serial test, 39, 60, 62, 74, 75, 78, 95, 
106, 109-115, 158. 
spectral test, 30, 35, 93-118, 169, 184. 
subsequence tests, 73, 158. 
theoretical tests, 41—42, 80-93. 
torture test, 79. 
Tex, iv, vi, 764. 
Tezuka, Shu (FBR), 164, 189, 546, 584. 
Thacher, Henry Clarke, Jr., 529. 
Theoretical tests for randomness, 
41-42, 80-93. 
Thiele, Thorvald Nicolai, 505. 
Thompson, John Eric Sidney, 196. 
Thomson, William Ettrick, 3, 11, 22. 


Thorup, Mikkel, 593. 
Thurber, Edward Gerrish, 466, 470, 
477, 478. 
Tichy, Robert Franz, 161. 
Tienari, Martti Johannes, 279. 
Tingey, Fred Hollis, 57. 
Tippett, Leonard Henr 
Tiwari, Prasoon (WaT 
Tobey, Robert George, 677. 
Tocher, Keith Douglas, 588. 
Todd, John, 35. 
Todd, Olga Taussky, 35, 106. 
Toeplitz, Otto, system, 721. 
Tonal System, 201. 
Tonelli, Alberto, 682. 
Toolkit philosophy, 487. 
Toom, Andrei Leonovich (Toom, Anzpeit 
Jleonosu), 296, 299, 306. 
Toom—Cook algorithm, 299-302, 
316-317, 672. 
Topological sorting, 480. 
Topuzoglu, Alev, 558. 
Torelli, Gabriele, 535. 
Torres y Quevedo, Leonardo de, 225. 
Torture test, 79. 
Totient function y(n), 19-20, 289, 369, 
376, 583, 646. 
Touchy-feely mathematics, 466, 477. 
Trabb Pardo, Luis Isidoro, 661. 
Trace of a field element, 687. 
Trager, Barry Marshall, 455, 689. 
Trailing digit, 195. 
Transcendental numbers, 378. 
Transitive permutation groups, 679. 
Transpose of a tensor, 507, 512-513. 
Transpositions, 147. 
Traub, Joseph Frederick, 138, 348, 428, 489, 
498, 505, 515, 531-534, 719. 
Trees: Branching information structures, 
413. 
binary, 378, 527, 696, 723. 
complete binary, 667. 
enumeration of, 527, 696, 723. 
oriented, 9, 464-465, 481—482. 
t-ary, 723. 
Trevisan, Vilmar, 452, 461. 
Trial quotients, 270—272, 278, 282. 
Triangularization of matrices, 444, 
659-660, 677. 
Tries, 687. 
Trigonometric functions, 279, 313, 490. 
Trilinear representation of tensors, 521—522. 
Trinomials, 32, 40, 572. 
Triple-precision floating point, 252. 
Trits, 207. 
Truncation: Suppression of trailing digits, 
207, 237-238, 309. 
Tsang, Wai Wan (RAH), 72. 
Tsu Ch’ung-Chih (= Zt. Chongzhi) 
GHZ), 198. 
Tsuji, Masatsugu (jE ]FYX), 264. 


Caleb, 3. 
), 316. 


Tukey, John Wilder, 701. 
Turán, Pál (= Paul), 372, 649. 
Turing, Alan Mathison, 3, 599. 
machines, 169, 499, 634. 
Twindragon fractal, 206, 210, 606. 
Two squares, sum of, 579-580. 
Two’s complement notation, 15, 188, 
203-204, 228, 275-276, 608. 
Twos’ complement notation, 204. 
Tydeman, Frederick John, 638. 


Ulam, Stanislaw Marcin, 138, 140, 189. 

Ullman, Jeffrey David, 694. 

Ullrich, Christian, 242. 

Ulp, 232-233. 

Underflow, exponent, 217, 221—222, 
227, 231, 241, 249. 

gradual, 222. 

ngar, Peter, 706. 

niform deviates: Random numbers with 
the uniform distribution, 138. 

generating, 10—40, 184-189, 193. 

logarithm of, 133. 

sorted, 57, 71, 135, 137, 141. 

square root of, 122. 

niform distribution, 2, 10, 48, 61, 119, 
121; 124, 263. 

nimodular matrix, 524. 


Ciel 


nits in a unique factorization domain, 
421-422, 435. 
nity: The number one, 336. 
roots of, 84, 531-532, 700; see 
also Cyclotomic polynomials, 
Exponential sums. 
Unlimited precision, 279, 283, 331, 416, 
see also Multiple-precision. 
Unnormalized floating point arithmetic, 
238-240, 244, 327. 
U 
U 


ec Ghee ie 


nusual correspondence, 9. 

seful primes, 291, 405, 407—408, 
549-550, 711. 

Uspensky, James Victor, 278. 


Vahlen, Karl Theodor, 653. 

Valach, Miroslav, 292. 

Valiant, Leslie Gabriel, 499. 

Vallée, Brigitte, 352, 355, 366, 644, 645. 

Vallée Poussin, Charles Jean Gustave 
Nicolas de la, 381. 

Valtat, Raymond, 202. 

van Ceulen, Ludolph, 198. 

van de Wiele, Jean-Paul, 497, 707. 

van der Corput, Johannes Gualtherus, 
163-164, 181. 

van der Poorten, Alfred Jacobus, 656. 

van der Waerden, Bartel Leendert, 196, 
433, 518, 690. 

van Halewyn, Christopher Neil, 403. 

van Leeuwen, Jan, 477, 515, 706. 

Van Loan, Charles Francis, 562, 701. 


nique factorization domain, 421—424, 436. 
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van Wijngaarden, Adriaan, 242. 

Vari, Thomas Michael, 717. 

Variables, 418, 486. 

Variance, unbiased estimate of, 232. 

Variance-ratio distribution, 135. 

Vassilevska Williams, Virginia Panayotova 
(Bacunescxa, Buprusua Ianaiioroza), 
TIT: 

Vattulainen, Ilpo Tapio, 75, 570. 

Vaughan, Robert Charles, 451. 

Velthuis, Frans Jozef, 764. 

Veltkamp, Gerhard Willem, 616. 

Vertex cover, 485. 

Vetter, Herbert Dieter Ekkehart, 629, 656. 

Viète, François, 198. 

Ville, Jean André, 597. 

Vitányi, Pál Mihály (= Paul Michael) 
Béla, 179. 

Vitter, Jeffrey Scott (HPEH), 121, 146. 

Vogel, Otto Hermann Kurt, 341. 

Voltaire, de (= Arouet, François 
Marie), 200. 

Volume of sphere, 105. 

von Fritz, Kurt, 335. 

von Mangoldt, Hans Carl Friedrich, 663. 

function, 371, 376. 

von Mises, Richard, Edler, 149, 177, 494. 

von Neumann, John (= Margittai Neumann 
János), 1, 3—4, 26, 36, 119, 125, 128, 
138, 140, 202, 226, 278, 327. 

von Schelling, Hermann, 65. 

von Schubert, Friedrich Theodor, 450. 

von zur Gathen, Joachim Paul Rudolf, 
449, 611, 673, 687. 

Vowels, Robin Anthony, 637. 

Vuillemin, Jean Etienne, 629, 649. 


Wadel, Louis Burnett, 205. 

Wadey, Walter Geoffrey, 226, 242. 

Waerden, Bartel Leendert van der, 196, 
433, 518, 690. 

Waiting time, 119, 136. 

Wakulicz, Andrzej, 205, 627. 

Wald, Abraham (= Ábrahám), 163, 

177=178. 

sequence, 164-165. 

Wales, Francis Herbert, 194, 202. 

Walfisz, Arnold, 382. 

Walker, Alastair John, 120, 127, 139. 

Wall, Donald Dines, 553. 

Wall, Hubert Stanley, 356. 

Wallace, Christopher Stewart, 132, 

141, 316, 590. 

Wallis, John, 199, 655. 

Walsh, Joseph Leonard, 502. 

Wang, Paul Shyh-Horng (E54), 452, 
455, 460-461, 657, 689. 

Ward, Morgan, 554. 

Waring, Edward, 503. 

Warlimont, Richard Clemens, 686. 

Watanabe, Masatoshi (JIE HEE), 764. 
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Waterman, Alan Gaisford, 40, 106-107, 
116, 144, 554, 596. 

Wattel, Evert, 466. 

Weather, 74. 

Wedge-shaped distributions, 125-126. 

Weigel, Erhard, 199. 

Weighing problem, 208. 

Weights and measures, 198-199, 201, 
209, 255, 326, 327. 

Weinberger, Peter Jay, 415, 678. 

Welch, Peter Dunbar, 701. 

Welford, Barry Payne, 232. 

Weyl, Claus Hugo Hermann, 181, 

379, 382, 596. 

Wheeler, David John, 226. 

White, Jon L (= Lyle), 635-636, 638. 

White sequence, 182. 

Whiteside, Derek Thomas, 486, 701. 

Whitworth, William Allen, 566, 568. 

Wichmann, Brian Anderson, 544. 

Wiedijk, Frederik, 665. 

Wiele, Jean-Paul van de, 497, 707. 

Wijngaarden, Adriaan van, 242. 

Wilf, Herbert Saul, 146. 

Wilkes, Maurice Vincent, 201, 226. 

Wilkinson, James Hardy, 241, 499. 

Williams, Hugh Cowie, 380, 390, 394, 

401, 412, 415, 661, 664. 

Williams, John Hayden, 541. 

Williams, Virginia Panayotova Vassilevska 

(Buprusua Ianañorosa Bacusescxa), 

ral 

Williamson, Dorothy, 115. 

Wilson, Edwin Bidwell, 134. 

Winograd, Shmuel, 280, 316, 500, 501, 
507, 509, 512-514, 520, 523, 705-707, 
710, 712, 714. 

Wirsing, Eduard, 363, 366, 376. 

WM1 (word size minus one), 252, 267, 613. 

Wolf, Thomas Howard, 192. 


Wolff von Gudenberg, Jürgen Freiherr, 242. 


Wolfowitz, Jacob, 69, 74. 

Woltman, George Frederick, 409. 

Wood, William Wayne, 115. 

Word length: Logarithm of word size. 

Word size, 12-16, 265, 276. 

Wrench, John William, Jr., 280, 379, 
627, 728. 

Wright, Edward Maitland, 384, 653. 

Wunderlich, Charles Marvin, 390, 
394, 399-400. 

Wynn, Peter, 356, 613. 

Wynn-Williams, Charles Eryl, 202. 


XOR (bitwise exclusive-or), 31, 32, 193, 419. 


Yagati, see Lakshman. 

Yaglom, Akiva Moiseevich (Armom, Axusa 
Monceesu4), 622. 

Yaglom, Isaak Moiseevich (Arnom, Ucaax 
Monceesu), 622. 

Yao, Andrew Chi-Chih (#ki##4#), 138, 170, 
179, 316, 378, 484, 485, 540. 

Yao, Frances Foong Chu (REWA), 484. 

Yates, Frank, 145, 173, 501-502. 

Yee, Alexander Jih-Hing (484#{—1), 280. 

Yohe, James Michael, 612. 

Young, Jeffery Stagg, 664. 

Younis, Saed Ghalib (Quin, Ae Sla), 311. 

Yuditsky, Davit Islam Gireevich (FOquuxuit, 
asur Ucnam Tupeesus), 292. 

Yun, David Yuan-Yee ([#7~—), 454-455, 
460, 686, 688, 689, 721. 

Yuriev, Sergei Petrovich (FOpnes, Cepreit 
Tlerposu4), 366. 


Z-independent vectors, 524. 
Zacher, Hans-Joachim, 200. 
Zaman, Arif (7,4, àl), 72, 75, 546, 
547, 549. 
Zantema, Hantsje, 696. 
Zaremba, Stanistaw Krystyn, 108, 115, 
117, 332, 584. 
Zaring, Wilson Miles, 653. 
Zassenhaus, Hans Julius, 446, 448, 449, 
455, 456, 681, 685. 
Zeilberger, Doron (197298 31117), 536, 683. 
Zero, 196, 336. 
leading, 222, 238-240, 327. 
minus, 202, 244-245, 249, 268, 274. 
order of magnitude, 239. 
polynomial, 418. 
Zero divisors, 671. 
Zeta function, 362, 382, 414, 644. 
Zhang, Linbo (FPR), 764. 
Ziegler Hunts, Julian James, 617. 
Zierler, Neal, 29. 
Zippel, Richard Eliot, 455, 675. 
Zuckerman, Herbert Samuel, 155-156. 
Zuse, Konrad, 202, 225, 227. 
Zvonkin, Alexander Kalmanovich (3noukuy, 
Anexcanap Kanmanosus), 170. 
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Character code: 


00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 


u ABCDEFGHI’ JKLMNOPQR STU 

00 1 01 2 02 2 03 10 

No operation rA+ rA+V rA«+ rA—V rAX + rAx V 

ADD (0:5) SUB(0:5) MUL (0:5) 
NOP (0) FADD (6) FSUB (6) FMUL (6) 

08 2 09 2 10 2 11 2 
rå V ril + V rIi2 + V rI V 
LDA (0:5) LD1 (0:5) LD2(0:5) LD3 (0:5) 

16 2 17 2 18 2 19 2 
rå + -V rll + —V rl2<¢ -V r13 «+ -V 

LDAN (0:5) LD1N (0:5) LD2N (0:5) LD3N (0:5) 

24 2 25 2 26 2 27 2 

M(F) + rA M(F) < rll M(F) + rl2 M(F) & rī3 
STA (0:5) ST1 (0:5) ST2(0:5) ST3(0:5) 

32 2 33 2 34 1 35 14+T 

M(F) rJ M(F) +0 Unit F busy? Control, unit F 
STJ (0:2) STZ (0:5) JBUS (0) IOC (0) 

40 1 41 1 42 1 43 1 

rA : 0, jump rll : 0, jump rI2 : 0, jump rI3 : 0, jump 

JA[+] Ji [+] J2[+] J3 [+] 

48 1 49 1 50 1 51 1 

rA [rA]? +M rll + [r11]? +M rI2 + [r12]? +M 113 + [r13]? +M 


INCA(0) DECA(1) 
ENTA (2) ENNA(3) 


INC1(0) DEC1(1) 
ENT1(2) ENN1(3) 


INC2(0) DEC2(1) 
ENT2(2) ENN2(3) 


INC3(0) DEC3(1) 
ENT3(2) ENN3(3) 


56 2 57 2 58 2 59 2 
CI & rA(F) : V CI e rI1(F) : V CI + rI2(F) : V CI + rI3(F) : V 
CMPA (0:5) f ; ; 
FCMP (6) CMP1(0:5) CMP2 (0:5) CMP3 (0:5) 


General form: 


| C t 


Description 


OP (F) 


C = operation code, (5 : 5) field of instruction 

F = op variant, (4 : 4) field of instruction 

M = address of instruction after indexing 

V = M(F) = contents of F field of location M 

OP = symbolic name for operation 
(F) = normal F setting 


t = execution time; T = interlock time 


25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 4 


VWx YZ0312345678 9 


49 50 51 52 53 54 55 


CC) ese f HS 6 > 0; 2, 


04 12 05 10 06 2 07 1+ 2F 

rA + rAX/V Special Shift M bytes Move F words 

rX + remainder NUM (0) SLA(O) SRA(1) from M to rI1 

DIV(0:5) CHAR(1) SLAX(2) SRAX(3) MOVE (1) 
FDIV(6) HLT (2) SLC(4) SRC(5) 

12 2 13 2 14 2 15 2 
r4¢V rl5<V rl6<¢ V rX V 
LD4(0:5) LD5(0:5) LD6(0:5) LDX (0:5) 

20 2 21 2 22 2 23 2 
rl4¢+ —V rlI5 + —V rl6 + —V rX + -V 
LD4N (0:5) LD5N (0:5) LD6N (0:5) LDXN (0:5) 

28 2 29 2 30 2 31 2 

M(F) + rl4 M(F) + rI5 M(F) + rI6 M(F) rX 
ST4 (0:5) ST5 (0:5) ST6 (0:5) STX (0:5) 

36 1+T 37 1+T 38 1 39 1 

Input, unit F Output, unit F Unit F ready? Jumps 

JMP(0) JSJ(1) 
IN (0) OUT (0) JRED (0) JOV(2) JNOV(3) 
also [*] below 

44 1 45 1 46 1 47 1 

rI4 : 0, jump rI5 : 0, jump rl6 : 0, jump rX : 0, jump 

J4[+] J5 [+] J6 [+] JX [+] 

52 1 53 1 54 1 55 1 

rl4 + [rl4]? +M rI5 + [r15]? +M rl6 + [r16]? +M rX [rX]? +M 


INC4(0) DEC4(1) 
ENT4 (2) ENN4(3) 


INC5(0) DEC5(1) 
ENT5(2) ENN5(3) 


INC6(0) DEC6(1) 
ENT6 (2) ENN6(3) 


INCX(0) DECX(1) 
ENTX (2) ENNX(3) 


60 2 61 2 62 2 63 2 
CI & rl4(F): V CI + rI5(F): V CI + rI6(F) : V CI + rX(F): V 
CMP4 (0:5) CMP5 (0:5) CMP6 (0:5) CMPX (0:5) 
[+] : [+]: 
rA = register A JL(4) < N() 
rX = register X JE(5) = Z(1) 
rAX = registers A and X as one JG(6) > P(2) 
rli = index register i, 1 < i < 6 JGE(7) > NN(3) 
rJ = register J JNE(8) # NZ(4) 
CI = comparison indicator JLE(9) < NP(5) 


